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Abstract 

Algorithms have two costs: arithmetic and communication. The latter represents the cost of moving 
data, either between levels of a memory hierarchy, or between processors over a network. Communication 
often dominates arithmetic and represents a rapidly increasing proportion of the total cost, so we seek 
algorithms that minimize communication. In [4 lower bounds were presented on the amount of communi- 
cation required for essentially all 0(n'^)-like algorithms for linear algebra, including eigenvalue problems 
and the SVD. Conventional algorithms, including those currently implemented in (Sca)LAPACK, perform 
asymptotically more communication than these lower bounds require. In this paper we present parallel 
and sequential eigenvalue algorithms (for pencils, nonsymmetric matrices, and symmetric matrices) and 
SVD algorithms that do attain these lower bounds, and analyze their convergence and communication 
costs. 

1 Introduction 

The running time of an algorithm depends on two factors: the number of floating point operations executed 
{arithmetic) and the amount of data moved either between levels of a memory hierarchy in the case of 
sequential computing, or over a network connecting processors in the case of parallel computing {commu- 
nication). The simplest metric of communication is to count the total number of words moved (also called 
the bandwidth cost). Another simple metric is to count the number of messages containing these words (also 
known as the latency cost). On current hardware the cost of moving a single word, or that of sending a 
single message, already greatly exceed the cost of one arithmetic operation, and technology trends indicate 
that this processor-memory gap is growing exponentially over time. So it is of interest to devise algorithms 
that minimize communication, sometimes even at the price of doing more arithmetic. 

In this paper we present sequential and parallel algorithms for finding eigendecompositions and SVDs 
of dense matrices, that do 0{n^) arithmetic operations, are numerically stable, and do asymptotically less 
communication than previous such algorithms. 

In fact, these algorithms attain known communication lower bounds that apply to many O(n^) algorithms 
in dense linear algebra. In the sequential case, when the n-by-n input matrix does not fit in fast memory 
of size M, the number of words moved between fast (small) and slow (large) memory is Q{n^/^/M). In the 
case of P parallel processors, where each processor has room in memory for 1/P-th of the input matrix, 
the number of words moved between one processor and the others is rt{in? /y/P). These lower bounds were 
originally proven for sequential [30 and parallel [32 matrix multiplication, and extended to many other linear 
algebra algorithms in [4 , including the first phase of conventional eigenvalue/SVD algorithms: reduction to 
Hessenberg, tridiagonal and bidiagonal forms. 
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Most of our algorithms, however, do not rely on reduction to these condensed forms; instead they rely 
on explicit QR factorization (which is also subject to these lower bounds). This raises the question as to 
whether there is a communication lower bound independent of algorithm for solving the eigenproblem. We 
provide a partial answer by reducing the QR decomposition of a particular block matrix to computing the 
Schur form of a similar matrix, so that sufficiently general lower bounds on QR decomposition could provide 
lower bounds for computing the Schur form. 

We note that there are communication lower bounds not just for the bandwidth cost, but also for the 
latency cost. Some of our algorithms also attain these bounds. 

In more detail, we present three kinds of communication-minimizing algorithms, randomized spectral 
divide- and- conquer^ eigenvectors from Schur Form^ and successive band reduction. 

Randomized spectral divide- and-conquer applies to eigenvalue problems for regular pencils A — XB^ the 
nonsymmetric and symmetric eigenvalue problem of a single matrix, and the singular value decomposition 
(SVD) of a single matrix. For A — XB it computes the generalized Schur form, and for a single nonsymmetric 
matrix it computes the Schur form. For a symmetric problem or SVD, it computes the full decomposition. 
There is an extensive literature on such divide-and-conquer algorithms: the PRISM project algorithm [2l[6], 
the reduction of the symmetric eigenproblem to matrix multiplication by Yau and Lu [35], the matrix sign 
function algorithm originally formulated in [STJ |40] , and the inverse-free approach originally formulated in 
[Ml HZl |36l |37l EH] , which led to the version [3 [14] from which we start here. Our algorithms will minimize 
communication, both bandwidth cost and latency cost, in both the two- level sequential memory model 
and parallel model (asymptotically, up to polylog factors). Because there exist cache-oblivious sequential 
algorithms for matrix multiplication [24 and QR decomposition [23], we conjecture that it is possible to 
devise cache-oblivious versions of the randomized spectral divide-and-conquer algorithms, thereby minimizing 
communication costs between multiple levels of the memory hierarchy on a sequential machine. 

In particular, we fix a limitation in [3] (also inherited by [14 J which needs to compute a rank-revealing 
factorization of a product of matrices A~^B without explicitly forming either A~^ or A~^B. The previous 
algorithm used column pivoting that depended only on 5, not A; this cannot a priori guarantee a rank- 
revealing decomposition, as the improved version we present in this paper does. In fact, we show here how 
to compute a randomized rank-revealing decomposition of an arbitrary product Af^ • • • A^^ without any 
general products or inverses, which is of independent interest. 

We also improve the old methods ([3], [M]) by providing a "high level" randomized strategy (explaining 
how to choose where to split the spectrum), in addition to the "basic splitting" method (explaining how to 
divide the spectrum once a choice has been made) . We also provide a careful analysis of both "high level" 
and "basic splitting" strategies (Section [3|, and illustrate the latter on a few numerical examples. 

Additionally, unlike previous methods, we show how to deal effectively with clustered eigenvalues (by 
outputting a convex polygon, in the nonsymmetric case, or a small interval, in the symmetric one, where the 
eigenvalues lie), rather than assume that eigenvalues can be spread apart by scaling (as in [35]). 

Given the (generalized) Schur form, our second algorithm computes the eigenvectors, minimizing com- 
munication by using a natural blocking scheme. Again, it works sequentially or in parallel, minimizing 
bandwidth cost and latency cost. 

Though they minimize communication asymptotically, randomized spectral divide-and-conquer algo- 
rithms perform several times as much arithmetic as do conventional algorithms. For the case of regular 
pencils and the nonsymmetric eigenproblem, we know of no communication-minimizing algorithms that also 
do about the same number of floating point operations as conventional algorithms. But for the symmetric 
eigenproblem (or SVD), it is possible to compute just the eigenvalues (or singular values) with very little ex- 
tra arithmetic, or the eigenvectors (or singular vectors) with just 2x the arithmetic cost for sufficiently large 
matrices (or 2.6x for somewhat smaller matrices) while minimizing bandwidth cost in the sequential case. 
Minimizing bandwidth costs of these algorithms in the parallel case and minimizing latency in either case 
are open problems. These algorithms are a special case of the class of successive hand reduction algorithms 
introduced in 18] . 

The rest of this paper is organized as follows. Section |2] discusses randomized rank-revealing decomposi- 
tions. Section [3] uses these decompositions to implement randomized spectral divide-and-conquer algorithms. 
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Section |4] presents lower and upper bounds on communication. Section [5] discusses computing eigenvectors 
from matrices in Schur form. Section |6] discusses successive band reduction. Finally, Section [7| draws con- 
clusions and presents some open problems. 



2 Randomized Rank- Revealing Decompositions 

Let A be an n X n matrix with singular values cri > (72 > . . . > cr^, and assume that there is a "gap" in the 
singular values at level /c, that is, (Ji/(Jk = 0(1), while (Jk/crk+i ^ 1- 

Informally speaking, a decomposition of the form A = URV is called rank revealing if the following 
conditions are fulfilled: 



1) U and V are orthogonal/unitary and R = 

{n — k) X {n — k); 



Rii Ri2 
O R22 



is upper triangular, with Rn k x k and R22 



2) cTminiRii) is a "good" approximation to (at most a factor of a low-degree polynomial in n away 
from it), 

(3) cFmax{R22) is a "good" approximation to a/c+i (at most a factor of a low-degree polynomial in n away 
from it); 

(4) In addition, if | |i^f/i^i2| I2 is small (at most a low-degree polynomial in n), then the rank-revealing 
factorization is called strong (as per [28 ). 

Rank revealing decompositions are used in rank determination [44^, least square computations [13], 
condition estimation [5^, etc., as well as in divide-and-conquer algorithms for eigenproblems. For a good 
survey paper, we recommend [28] . 

In the paper [14], we have proposed a randomized rank revealing factorization algorithm RURV. Given 
a matrix A, the routine computes a decomposition A = U RV with the property that is a rank-revealing 
matrix; the way it does it is by "scrambling" the columns of A via right multiplication by a uniformly 
random orthogonal (or unitary) matrix . The way to obtain such a random matrix (whose described 
distribution over the manifold of unitary /orthogonal matrices is known as Haar) is to start from a matrix 
B of independent, identically distributed normal variables of mean and variance 1, denoted here and 
throughout the paper by A^(0, 1). The orthogonal/unitary matrix V obtained from performing the QR 
algorithm on the matrix B is Haar distributed. 

Performing QR on the resulting matrix AV^ =: A = UR yields two matrices, U (orthogonal or unitary) 
and R (upper triangular), and it is immediate to check that A = URV. 



Algorithm 1 Function [U^R^V] =RURV(74), computes a randomized rank revealing decomposition A = 
URV, with V a Haar matrix. 

1: Generate a random matrix B with i.i.d. A^(0, 1) entries. 

2: [V,R] =QK{B). 

3: A = A-V^. 

4: [U,R] = QR(i). 

5: Output R. 



It was proven in that, with high probability, RURV computes a good rank revealing decomposition 
of A in the case of A real. Specifically, the quality of the rank-revealing decomposition depends on computing 
the asymptotics of /r,n, the smallest singular value of an r x r submatrix of a Haar-distributed orthogonal 
nxn matrix. All the results of [Mj can be extended verbatim to Haar-distributed unitary matrices; however, 
the analysis employed in [H] is not optimal. In [22 , the asymptotically exact scaling and distribution of 
fr,n were computed for all regimes of growth (r, n) (naturally, < r < n). Essentially, the bounds state that 
Y^r(n — r)fr^n converges in law to a given distribution, both for the real and for the complex cases. 
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The result of [IT states that RURV is, with high probabihty, a rank-reveahng factorization. Here we 
strengthen these results to argue that it if in fact a strong rank-revealing factorization, in the sense of 
since with high probability | |i?^2^^i?i2| | will be small. We obtain the following theorem. 

Theorem 2.1. Let A be an n x n matrix with singular values ai, . . . , cr^, cr^+i, . . . , (7^. Let e > be a small 
number. Let R be the matrix produced by the RURV algorithm on A. Assume that n is large and that 
^ = M and that > Cn for some constants M and C > 1, independent of n. 

There exist constants ci, C2, and C3 such that, with probability 1 — 0(e) ^ the following three events occur: 

Vr(n - r) 



I^ii"^^i2||2 < cz^r{n - r) . 



Proof. There are two cases of the problem, r < n/2 and r > n/2. Let V be the Haar matrix used by the 
algorithm. From Theorem 2.4-1 in [27], the singular values of 1/(1; r, 1 : r) when r > n/2 consist of (2r — n) 
I's and the singular values of V{{r + 1) : n, (r + 1) : n). Thus, the case r > n/2 reduces to the case r < n/2. 
The first two relationships follow immediately from [14 and [22]; the third we will prove here. 
We use the following notation. Let A = PY^Q^ = P-diag(I]i, Yi2)'Q^ be the singular value decomposition 
of A, where Si = diag(<Ji, . . . , a^) and E2 = diag(<Jr+i, . . . , cFn)- Let be the random unitary matrix in 
RURV. Then X = Q^V^ has the same distribution as V^, by virtue of the fact that V's distribution is 
uniform over unitary matrices. 
Write 

Xii X12 
X21 X22 



X 



where Xn is r x r, and X22 is {n — r) x (n — r). 
Then 

U^P -YX = R ; 

denote S • X = [Fi, Y2] where Fi is an n x r matrix and Y2 in an (n — r) x n one. Since P is unitary, in 
effect 

R11R12 = ^i^^2 5 

where is the pseudoinverse of Yi, i.e. Y^ = {Y^Yi)~^Y^ . We obtain that 

^1 1"^ ^1 2 = (-^n^^ -^11 + -^21^2-^21) (-^11^1-^12 + -^21^2-^22) • 

Examine now 

Ti := (^XiiT^lXii +X|{^i;2-^2i) XiiT>lXi2 = X-^^ (S^ + {X2iX-^-^)^T>l{X2iX-^-^)) S1X12 . 
Since X12 is a submatrix of a unitary matrix, ||Xi2|| < 1, and thus 

iiTiii < iiXnr^ii/. + Er2(X2ixr/)^si(X2ixr/)ii-i < ^ ■ . ^ \y . (i) 

Given that crmin(-^ii) = 0{{^r{n — r))~^) with high probability and cFr/cFr+i > Cn^ it follows that the 
denominator of the last fraction in ([T]) is 0(1). Therefore, there must exist a constant such that, with 
probability bigger than 1 — e, ||Ti|| < C2,^r{n — r). Note that C3 depends on e. 
The same reasoning applies to the term 

T2 := {XiiTj\Xii +X|]^E2X2i) X2i5]2^22 ; 
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to yield that 

1IT2I1 < i|xri^iniJ. + sr^(X2ixri^)^si(X2ixrii)i|-i||sr'i|.||siii, (2) 

because IIX22II < 1. 

The conditions imposed in the hypothesis ensure that llXfi^ll • ||S^i|| • IIS2II = 0(1), and thus \\T2\\ = 
0(1). 

From ^ and the conclusion follows. □ 

The bounds on crmin(^ii) are as good as any deterministic algorithm would provide (see [28^). However, 
the upper bound on crmax(^22) is much weaker than in corresponding deterministic algorithms. We suspect 
that this may be due to the fact that the methods of analysis are too lax, and that it is not intrinsic to the 
algorithm. 

This theoretical sub-optimality will not affect the performance of the algorithm in practice, as it will be 
used to differentiate between singular values dr+i that are very small (polynomially close to 0) and singular 
values (Jr that are close to 1, and thus far enough apart. 

Similarly to RURV we define the routine RULV, which performs the same kind of computation (and 
obtains a rank revealing decomposition of A), but uses QL instead of QR, and thus obtains a lower triangular 
matrix in the middle, rather than an upper triangular one. 

Given RURV and RULV, we now can give a method to find a randomized rank-revealing factorization 
for a product of matrices and inverses of matrices, without actually computing any of the inverses or matrix 
products. This is a very interesting and useful procedure in itself, but we will also use it in Sections [3] and 
|3.2| in the analysis of a single step of our Divide- and- Conquer algorithms. 

Suppose we wish to find a randomized rank-revealing factorization = URV for the matrix = 
^mi . ^ where Ai, . . . ,Ak are given matrices, and mi, ... , rrik G { — 1,1}, without actually 

computing Mk or any of the inverses. 

Essentially, the method performs RURV or, depending on the power, RULV, on the last matrix of the 
product, and then uses a series of QR/RQ to "propagate" an orthogonal/unitary matrix to the front of 
the product, while computing factor matrices from which (if desired) the upper triangular R matrix can 
be obtained. A similar idea was explored by G.W. Stewart in [45 to perform graded QR; although it was 
suggested that such techniques can be also applied to algorithms hke URV, no randomization was used. 

The algorithm is presented in pseudocode below. 

Lemma 2.2. GRURV ( Generalized Randomized URVy) computes the rank-revealing decomposition Mk = 

^ current -L^X • • • -^/^ ^ • 

Proof. Let us examine the case when k = 2 {k > 2 results immediately through simple induction). 
Let us examine the cases: 

1. m2 = 1. In this case, M2 = A^^Ms; the first RURV yields M2 = A'^'UR2V. 

(a) if mi = 1, M2 = A1UR2V; performing QR on AiU yields M2 = U cur r ent ^1^2^ • 

(b) if mi = -1, M2 = A^^UR2V] performing RQ on Ai yields M2 = UcurrentRi^ R2V . 

2. m2 = -1. In this case, M2 = A'l'^A^^; the first RULV yields M2 = A'^^ULl^^V = A^^^'UR^W. 

(a) if mi = 1, M2 = AiUL^^V = AiUR^^V; performing QR on AiU yields M2 = UcurrentRiR^^V . 

(b) finally, if ms = -1, M2 = A^^UL^^V = A-^UR^^V; performing RQ on U^Ai yields M2 = 

UcurrentR\ R2 ^ ' 

Note now that in all cases Mk = UcurrentRT^ • • • R]^^V. Since the inverse of an upper triangular matrix 
is upper triangular, and since the product of two upper triangular matrices is upper triangular, it follows 
that R := R^^ . . . R^^ is upper triangular. Thus, we have obtained a rank-revealing decomposition of Mk] 
the same rank-revealing decomposition as if we have performed QR on MkV^ . □ 
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Algorithm 2 Function U =GRURV(/c; Ai, . . . , A/^; mi, . . . , m/c), computes the U factor in a randomized 
rank revealing decomposition of the product matrix = ■ A^'^ • . . . A^^ , where mi, . . . , ruk G { — 1,1}. 

if ruk = 1, then 

[/7,i?fc,F] =RURV(Afc) 
else 

[/7,Lfe,F] =RULV(Af) 

Rk 
end if 

Ucurrent — U 

for i = k — 1 downto 1 do 
if m^ = 1, then 

[U^Ri] = QIl{Ai • Ucurrent) 
Ucurrent — U 

else 

[U,Ri] = RQ(/7c^^^g^^ • 
— rr^ 

^current — '-^ 

end if 
end for 

return Ucurrent, optiouahy V,Ri,...,Rk 



By using the stable QR and RQ algorithms described in [14], as well as RURV, we can guarantee the 
following result. 

Theorem 2.3. The result of the algorithm GRURV is the same as the result of RURV on the (explicitly 
formed) matrix M^; therefore, given a large gap in the singular values of Mk, cr^+i <C cr^ ~ cri^ the algorithm 
GRURV produces a rank-revealing decomposition with high probability. 

Note that we may also return the matrices i?i, . . . , i?/c, from which the factor R can later be reassembled, 
if desired (our algorithms only need U current , not R = R^^ . . . R^^ ^ and so we will not compute it). 

3 Divide-and-Conquer: Four Versions 

As mentioned in the Introduction, the idea for the divide-and-conquer algorithm we propose has been first 
introduced in [36 ; it was subsequently made stable in [3^, and has then been modified in [TT to include a 
randomized rank-revealing decomposition, in order to minimize the complexity of linear algebra (reducing 
it to the complexity of matrix multiplication, while keeping algorithms stable). The version of RURV 
presented in this paper is the only complete one, through the introduction of GRURV; also, the new 
analysis shows that it has stronger rank-revealing properties than previously shown. 

Since in [14 we only sketched the merger of the rank-revealing decomposition and the eigenvalue/singular 
value decomposition algorithms, we present the full version in this section, as it will be needed to analyze 
communication. We give the most general version RGNEP (which solves the generalized, non-symmetric 
eigenvalue problem), as well as versions RNEP (for the non-symmetric eigenvalue problem), RSEP (for 
the symmetric eigenvalue problem), and RSVD (for the singular value problem). The acronyms here are 
the same used in LAPACK, with the exception of the first letter "R", which stands for "randomized". 

In this section and throughout the rest of the paper, QR, RQ, QL, and LQ represent functions/routines 
computing the corresponding factorizations. Although these factorizations are well-known, the implementa- 
tions of the routines are not the "standard" ones, but the communication-avoiding ones described in [4 . We 
only consider the latter, for the purpose of minimizing communication. 
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3.1 Basic splitting algorithms (one step) 



Let V be the interior of the unit disk. Let A, B be two n x n matrices with the property that the pencil 
(A, B) has no eigenvalues on V. The algorithm below "splits" the eigenvalues of the pencil {A, B) into two 
sets; the ones inside V and the ones outside V. 
The method is as follows: 

1. Compute (/ + {A~^BY^)~^ implicitly. Roughly speaking, this maps the eigenvalues inside the unit 
circle to and those outside to 1. 

2. Compute a rank-revealing decomposition to find the right defiating subspace corresponding to eigen- 
values inside the unit circle. This is spanned by the leading columns of a unitary matrix Qr. 

3. Analogously compute Ql from (/ + {A~^ B^Y^)~^. 

4. Output the block- triangular matrices 



The pencil {An^Bn) has no eigenvalues inside V] the pencil (^22,^22) has no eigenvalues outside V. 
In exact arithmetic and after complete convergence, we should have ^21 = ^21 = 0. In the presence of 
fioating point error with a finite number of iterations, we use ||^2i||i/||^||i and 1 1^21 1 |i/| |^| |i to measure 
the stability of the computation. 

To simplify the main codes, we first write a routine to perform (implicitly) repeated squaring of the 
quantity A~^B. The proofs for correctness are in the next section. 



Algorithm 3 Function IRS, performs implicit repeated squaring of the quantity A ^B for a pair of matrices 

(A 5) 



Let Aq = A and = ^; j = 0. 
repeat 



Bj \ ^ ( Qii Q12 \ ( Rj 
-Aj ) \ Q21 Q22 



4: if \\Rj — Rj < T\\Rj-i\\i , ... convergence! 

then 

5: p = j + 1 , 

6: else 

7: j = j + 1 

8: end if 

9: until convergence or j > maxit. 

10: return Ap, Bp. 



Armed with repeated squaring, we can now perform a step of divide- and-conquer. We start with the 
most general case of a two-matrix pencil. 

The algorithm RGNEP starts with the right deflating subspace: line 1 does implicit repeated squaring 
of A~^B^ followed by a rank-revealing decomposition of {Ap + Bp)~^Ap (line 2). The left deflating subspace 
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Algorithm 4 Function RGNEP, performs a step of divide-and-conquer on a pair of matrices A and B 



1: [Ap,Bp] =mS{A,B). 

2: Qr =GBJJBV {2, Ap + Bp, Ap, -1,1). 

3: [Ap,Bp] =mS{A",B"). 

4: Ql =GmJRV (2, A^, (Ap + Bp)", 1,-1). 

5: 

i := QfAQ„ = ( ) . ,3) 

6 := QiBQ, = ( £ ) ^ (4) 

the dimensions of the subblocks are chosen so as to minimize ||£^2i||i/||^||i + ||^2i||i/||^||i; 
6: return (A, B^Ql, Qr) and the dimensions of the subblocks. 



is handled similarly in lines 3 and 4. Finally, line 5 divides the pencil by choosing the split that minimizes 
the sum of the relative norms of the bottom-left submatrices, e.g. | |£^2i | |i/| |^| |i + ||^2i||i/||^||i- If this 
norm is small enough, convergence is successful. In this case £^21 and F21 may be zeroed out, dividing the 
problem into two smaller ones, given by (An, ^n) and (A22, ^22)- Note that if more than one pair of blocks 
(£^21, ^21) is small enough to zero out, the problem may be divided into more than two smaller ones. 



Algorithm 5 Function RNEP, performs a step of divide-and-conquer on a non-hermitian matrix A; I here 
is the identity matrix. 

1: [Ap.Bp] =IRS(A,/). 

2: Q =GRURV(2, + 

3: 

the dimensions of the subblocks are chosen so as to minimize ||£2i||i/||A-||i; 
4: return (A, Q) and the dimensions of the blocks. 



The algorithm RNEP deals with the non-symmetric eigenvalue problem, that is, when A is non-hermitian 
and B = I. In this case, we skip the computation of the left deflating subspace, since it is the same as the 
right one; also, it is sufficient to consider ||£2i||i/||A||i, since it is the backward error in the computation. 
Note that Q is not needed, as in exact arithmetic B = I. 

The algorithm RSEP is virtually identical to RNEP, with the only difference in the second equation of 
line 3, where we enforce the symmetry of A. We choose to write two separate pieces of code for RSEP and 



RNEP, as the analysis of theses two codes in Section 3.2 differs. 



We choose to explain the routine for computing singular values of a matrix A, rather than present it 
in pseudocode. Instead of calculating the singular values of A directly, we construct the hermitian matrix 

r A ] 

B = j^jj Q and compute its eigenvalues (which are the singular values of A and their negatives) and 

eigenvectors (which are concatenations of left and right singular vectors of A). Thus, computing singular 
values completely reduces to computing eigenvalues. 
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Algorithm 6 Function RSEP, performs a step of divide- and- conquer on a hermitian matrix A; I here is 
the identity matrix. 



1: [Ap,Bp] =IRS(A,/). 

2: [U,Ri,V] =GRURV(2,^p + Bp,^p,-l,l)- 

3: 

A := Q"AQ 
A := 

\ ^21 ^22 J 

the dimensions of the subblocks are chosen so as to minimize ||£^2i||i/||^||i; 
4: return (A, Q) and the dimensions of the blocks. 



3.2 One step: correctness and reliability of implicit repeated squaring 

Assume for simphcity that ah matrices involved are invert ible, and let us examine the basic step of the 
algorithm IRS. It is easy to see that 

Qfi Q| \f Bj Qf.Bj - Q§,Aj \ _ ( R, 



thus Qi2Bj — Q22^3^ and then B^A - ^ = Qi'^Q22] finally, 

= ^J^QifQ22Bj = i^J^^j)'^ ^ 

proving that the algorithm IRS repeatedly squares the eigenvalues, sending those outside the unit circle to 
00, and the ones inside to 0. 

As in the proof of correctness from [3 , we note that 



{Ap + Bp)-^Ap = (/ + A^^Bp)-^ = (/ + {A-'B 



,2P\ 



and that the latter matrix approaches Pr^\z\>i^ the projector onto the right deflating subspace corresponding 
to eigenvalues outside the unit circle. 



3.3 High level strategy for divide-and-conquer, single non-symmetric matrix 
case 

For the rest of this section, we will assume that the matrix B = I and that the matrix A is diagonalizable, 
with eigenvector matrix S and spectral radius p{A). 

The first step will be to find (e.g., using Gershgorin bounds) a radius R > so that we are assured 

that all eigenvalues are in the disk of radius R. Choose now a random angle ^, uniformly from [0,27r], and 
draw a radial line 1Z through the origin, at angle 0. The line 1Z intersects the circle centered at the origin of 
radius R/2 at two points defining a diameter of the circle, A and B. We choose a random point a uniformly 
on the segment AB and construct a line C perpendicular to AB through a. 

The idea behind this randomization procedure is to avoid not just "hitting" the eigenvalues, but also 
"hitting" some appropriate e-pseudospectrum of the matrix, defined below (the right choice for e will be 
discussed later). 

Definition 3.1. (Following [46 .) The e-pseudospectrum of a matrix A is defined as 

Ae(A) = G C I z is an eigenvalue of A-\- E for some E with ||£^||2 < e} 
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We measure progress in two ways: one is when a split occurs, i.e., we separate some part of the pseu- 
dospectrum, and the second one is when a large piece of the disk is determined not to contain any part of 
the pseudospectrum. This explains limiting the choice of £ to only half the diameter of the circle; this way, 
we ensure that as long as we are not "hitting" the e-pseudospectrum, we are making progress. See Figure [l] 




Progress! Progress! No progress... 



Figure 1: The possible outcomes of choosing a splitting line. The shaded shapes represent pseudospectral 
components, the dashed inner circle has radius the dashed line is at the random angle ^, and the solid 
line is C. 



Remark 3.2. It is thus possible that the ultimate answer could be a convex hull of the pseudospectrum; 
this is an alternative to the output that classical (or currently implemented) algorithms yield, which samples 
the pseudospectrum. Since in general pseudospectra can be arbitrarily complicated (see [39 ), the amount of 
information gained may be limited; for example, the genus of the pseudospectrum will not be specified (i.e., 
the output will ignore the presence of "holes", as in the third picture on Figure [T]). However, in the case of 
convex or almost convex pseudospectra, a convex hull may be more informative than sampling. 

Remark 3.3. One could also imagine a method by which one could attempt to detect "holes" in the 
pseudospectrum, e.g., by sampling random points inside of the convex hull obtained and "fitting" disks 
centered at the points. Providing a detailed explanation for this case falls outside the scope of this paper. 

Let us now compute a bound on the probability that the line C does not intersect a particular pseu- 
dospectrum of A (which will depend on the machine precision, norm of A, etc.). To this extent, we will first 
examine the probability that the line C is at (geometrical) distance at least e from the spectrum of A. 

Note that the problem is rotationally invariant; therefore it suffices to consider the case when ^ = 0, and 
simply examine the case when £ is x = a, i.e., our random line is a vertical line passing through the point a 
on the segment [—R/2^R/2]. 

Consider now the n e-balls that C must not intersect, and put a vertical "tube" around each one of 
them (as in Figure [2|. These tubes will intersect the segment [—R^B\ in n (possibly overlapping) small 
segments (of length 2e). In the worst-case scenario, they are all inside the segment [— i?/2, i?/2], with no 
tubes overlapping. If the point a, chosen uniformly, falls into any one of them, the random line C will fall 
within e of the corresponding eigenvalue. This happens with probability at most 2en/R. 

The above argument proves the following lemma: 
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Lemma 3.4. Let dg be the geometrical distance between the (randomly chosen) line C and the spectrum. 
Then 



P[dg >e]> max{l - ^ , 0} 
-ft 



(5) 



-R/2 



a 



I _ I 



I 1^1 
I I I 



'R/2 



Figure 2: Projecting the e tubes onto [—R^R]; the randomly chosen line C (i.e., x = a) does not intersect 
the (thick) projection segments (each of length 2e) with probability at least 1 — The disks are centered 
at eigenvalues. 

We would like now to estimate the number of iterations that the algorithm will need to converge, so we 
will use the results from |3j . To use them, we need to map the problem so that the splitting line becomes the 
unit circle. We will thus revisit the parameters that were essential in understanding the number of iterations 
in that case, and we will see how they change in the new context. 

We start by recalling the notation d(^A,B) foi" the "distance to the nearest ill-posed problem" (in our case, 
"ill-posed" means that there is an eigenvalue on the unit circle) for the matrix pencil (A, .B), defined as 
follows: 



d{A,B) = + 11^11 ' {A-\- E) — z{B + F) is singular for some z where \z\ = 1} ; 

according to Lemma 1, Section 5 from [3 , d(^A,B) = min^i crmin(^ — e^^B). 

We use now the estimate for the number p of iterations given in Theorem 1 of Section 4 from [3 . Provided 
that the number of iterations, p, is larger than 



p > log2 



di 



iA,B) 



^iA,B) 



(6) 



the relative distance between {Ap + Bp) ^Ap and the projector Pr^\z\>i may be bounded from above, as 
follows: 



\\{Ap^Bp)-^Ap-Pn,\,\^,\\ ^ 
^ \\Pr,\z\>i\\ ~~ 



2P+3 



1 



\\iA,B)\\ 



2P 



max fo,l -2^^+2 (l - ^ 



\iA,B)\\ 



2P 



This implies that (for p satisfying (|6|) convergence is quadratic and the rate depends on d(^A,B)/\\{A, B)\\, 
the size of the smallest perturbation that puts an eigenvalue of (A, B) on the unit circle. 

The new context, however, is that of lines instead of circles; so, if we start with (A,/) and we choose a 
random line (w.l.o.g. assume that the line is Re{z) = a) that we then map to the unit circle using the map 
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/(^) = l-a-1 ' pencil (A, /) gets mapped to {A — {a — 1)/, A — (a + 1)/), and the distance to the closest 
ill-defined problem for the latter pencil is <i(A,/) = niin^* crmin((^ — (a — 1)/) — 6*^(^4 — (a + 1)^))- 
Let now 

d/:{A,B) = inf{||£;|| + ||F|| : {A^ E) - z{B + F) is singular for some z where Re{z) = a} 

be the size of the smallest perturbation that puts an eigenvalue of (A, B) on the line C. 
It is a quick computation to show that 

dc{A,B) = mm2\smd)\a,nin{A-aB^^-^B) 
9 2 1 — e*^ 

and from here, redefining as 0/2^ we obtain that 

dc(A,B)=2 min a^inl sm( 0)( A - aB) ^ i cos (0)b) . (7) 

6'G[0,7r] V / 

Let us now define na{A, B) = \\{A -aB^B,A-aB - B)\\. 

Combining equations (|6| and ^ for the case of a pencil (A, /), we obtain that, if p, the number of itera- 
tions, is larger than \og2{na{A^ I) / dc{A^ I) — 1), then the relative distance between the computed projector 
and the actual one decreases quadratically. This means that, if we want a relative error err^ < e, we will 
need \og2{na{A,I) / dc{A,I) - 1) + log2 logs (7) iterations. 

How do dc and dg relate? In general, the dependence is complicated (see [39^), but a simple relationship 
is given by the following result by Bauer and Fike, which can be found for example as Theorem 2.3 from 
[46]: dc{A,I) > where we recall that S is the eigenvector matrix. We obtain the following lemma. 

Lemma 3.5. Provided that e is suitably small, then with probability at least 1 — the number of iterations 
to make err^ < e is at most 

p = log2(na(A, I)K{S)/e - 1) + log2 log2 

Finally, we have to consider e. In practice, every non-trivial stable dense matrix algorithm has a backward 
error bounded by e^/(n)| l^l |, where f{n) is a polynomial of small degree and is the machine precision. 
This is the minimum sort of distance that we can expect to be from the pseudospectrum. Thus, we must 
consider e = c • • /(n) • |A| |. 

Remark 3.6. Naturally, this makes sense only if e < 1. 

Recall that here R is the bound on the spectral radius (perhaps obtained via Gershgorin bounds). In 
practice R = 0(||74||); assuming that we have prescaled A so that = 0(1), it follows that a, as well as 
na{A^I) are of the same order. The only question may be, what happens if we "make progress" without 
actually splitting eigenvalues, e.g., what if is much smaller than R at first? The answer is that we 
decrease R quickly, and here is a 2-step way to see it. 

Suppose that we have localized all eigenvalues in the circular segment made by a chord C at distance at 
most R/2 from the center (see Figure [s] below). We will consider the worst case, that is, when we split no 
eigenvalues, but "chop off" the smallest area. We will proceed as follows: 

1. The next random line C will be picked to be perpendicular on C, at distance at most R/2 from the 
center. All bounds on probabilities and number of iterations computed above will still apply. 

2. If, at the next step, we once again split off a large part of the circle with no eigenvalues, the remaining 
area fits into a circle of radius at most sin(57r/12)i?; if we wish to simplify the divide-and-conquer 
procedure, we can consider the slightly larger circle, recenter the complex plane, and continue. Note 
that at the most, the new radius is only a fraction of sin(57r/12) ^ .965 of the old one. 
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Step 1 : the filled-in region 
contains no eigenvalues 




Step 2: the filled-in region 
contains no eigenvalues 



The remaining region fits 
into the circle of diameter AB 



Figure 3: The 2-step strategy reduces the radius of the bounding box. 




3. If, at the second step, we split the matrix, we can simply examine the Gershgorin bounds for the new 
matrices and continue the recursion. 

Thus, we are justified in assuming that R = 0(||A||); to sum up, we have the following corollary: 

Corollary 3.7. With probability at least 1 — c • e • n • f{n) ■ k,{S), the number of iterations needed for progress 
at every step is at most log2 cef{n) ^ -^^^2 e • 

3.3.1 Numerical experiments 

Numerical experiments for Algorithm |5] are given in Figures |4] and [5j We chose four representative examples 
and present for each the distribution of the eigenvalues in the complex plane and the convergence plot of the 
backward error after a given number of iterations of implicit repeated squaring. This backward error is ^^i^y^^ 
in the notation given in Algorithm [5j where the dimension of £^21 is chosen to minimize this quantity. In 
practice, a different test of convergence can be used for the implicit repeated squaring, but for demonstrative 
purposes, we computed the invariant subspace after each iteration of repeated squaring in order to show the 
convergence of the backward error of an entire step of divide- and-conquer. 

The examples presented in Figure [4] are normal matrices generated by choosing random eigenvalues (where 
the real and imaginary parts are each chosen uniformly from [—1.5, 1.5]) and applying a random orthogonal 
similarity transformation. The only difference is the matrix size. We point out that the convergence rate 
depends not on the size of the matrix but rather the distance between the splitting line (the imaginary axis) 
and the closest eigenvalue (as explained in the previous section). This geometric distance is the same as 
the distance to the nearest ill-posed problem because in this case, the condition number of the (orthogonal) 
diagonalizing similarity matrix is 1. The distance between the imaginary axis and the nearest eigenvalue is 
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8.36 • lO"'^ in Figure 4(a; 



(n = 100) and 1.04 • 10~^ in Figure 4(b) (n = 1000); note that convergence occurs 
at about the same iteration. 

The examples presented in Figure [5] are meant to test the boundaries of convergence of Algorithmjsj While 
the eigenvalues in these examples are artificially chosen to be close to the splitting line (imaginary axis), 
our randomized higher level strategy for choosing splitting lines will ensure that such cases will be highly 
unlikely. Figure 5(a) presents an example where the eigenvalues are chosen with random imaginary part 
(between —1.5 and 1.5) and real part set to ±10~^^ and the eigenvectors form a random orthogonal matrix. 
Because all the eigenvalues lie so close to the splitting line, convergence is much slower than in the examples 
with random eigenvalues. Note that because the imaginary axis does not intersect the e-pseudospectrum 
(where e is machine precision), convergence is still achieved within 40 iterations. 

The example presented in Figure 5(b) is a 32-by-32 non-normal matrix which has half of its eigenvalues 
chosen randomly with negative real part and the other half set to 0.1, forming a single Jordan block. In this 
example, the imaginary axis intersects the e-pseudospectrum and so Algorithm [5] does not converge. Note 
that the eigenvalue distribution in the plot on the left shows 16 eigenvalues forming a circle of radius 0.1 
centered at 0.1. The eigenvalue plots are the result of using standard algorithms which return 16 distinct 
eigenvalues within the pseudospectral component containing 0.1. 



Eigenvalues, n=100 Convergence Plot 




(a) 100 X 100 matrix with random eigenvalues 



Eigenvalues, n=1000 Convergence Plot 




-1 1 5 10 15 20 25 



(b) 1000 X 1000 matrix with random eigenvalues 

Figure 4: Numerical experiments using normal matrices for Algorithm RNEP. On the left of each subfigure 
is a plot of the eigenvalues in the complex plane (as computed by standard algorithms) with the imaginary 
axis (where the split occurs) highlighted. In each case, the eigenvalues have real and imaginary parts 
chosen uniformly from [—1.5, 1.5]. On the right is a convergence plot of the backward error as described in 
Section 13.3.11 
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Eigenvalues, n=25 



Convergence Plot 
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(a) 25 X 25 matrix with eigenvalues whose real parts are ±10 
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Eigenvalues, n=32 



Convergence Plot 
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(b) 32 X 32 matrix with half the eigenvalues forming Jordan block at 0.1 



Figure 5: Numerical experiments for Algorithm RNEP. On the left of each subfigure is a plot of the 
eigenvalues in the complex plane ( as co mputed by standard algorithms) with the imaginary axis (where the 
split occurs) highlighted. Note in 5(a), all eigenvalues are at a distance of 10~^^ from the imaginary axis. 
In [5(b)] the imaginary axis intersects the e-pseudospectrum (where e is machine precision). On the right is 
a convergence plot of the backward error as described in Section [3.3.1 [ for up to 60 iterations. 



3.4 High level strategy for divide-and-conquer, single symmetric matrix (or 
SVD) case 

In this case, the eigenvalues are on the real line. After obtaining upper and lower bounds for the spectrum 
(e.g., Gershgorin intervals), we use vertical lines to split it. In order to try and split a large piece of the 
spectrum at a time, we will choose the lines to be close to the centers of the obtained intervals. 

From Lemma 3 of [3] we can see that the basic splitting algorithm will converge if the splitting line does 
not intersect some e-pseudospectrum of the matrix, which consists of a union of at most n 2e-length intervals. 
Note that here we only consider symmetric perturbations to the matrix, and thus the pseudospectrum is a 
union of intervals. 

To maximize the chances of a good split we will randomize the choice of splitting line, as follows: for 
an interval /, which can without loss of generality be taken as [— i?, i?], we will choose a point x uniformly 
between [—R/2, R/2] and use the vertical line passing through x for our split. See Figure [6] below. 

As before, we measure progress in two ways: one is when a successful split occurs, i.e., we separate some 
of the eigenvalues, and the second one is when no split occurs, but a large piece of the interval is determined 
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-R -R/2 , X R/2 R 



-R -R/2 



R/2 R 



-R -R/2 , X R/2 R 



Progress! 



Progress! 



No progress. 



Figure 6: The possible outcomes of splitting the segment [—R^R]. The thick dark segments represent 
pseudospectral components (unions of overlapping smaller segments), while the dashed line represents the 
random choice of split. 



not to contain any eigenvalues. Thus, progress can fail to occur in this context only if our choice of x falls 
inside one (or more) of the small intervals representing the e-pseudospectrum. The total length of these 
intervals is at most 2en. Therefore, it is a simple calculation to show that the probability that a randomly 
chosen line in [—R/2^ R/2] will intersect the e-pseudospectrum is larger than 1 — 2ne/R. 

As before, the right choice for e must be proportional to c/(n) ||A||, with /(n) a polynomial of low degree; 
also as before, R will be 0(||A||); putting these two together with Lemma 3.5, we get the following. 



Corollary 3.8. With probability at least 1 — c ■ e ■ n - f{n), the number of iterations is at most log2 cef{n) ^~ 
log2 log2 \. 

Finally, we note that a large cluster of eigenvalues actually helps convergence in the symmetric case, as 
the following lemma shows: 



Lemma 3.9. If A 



All Al^ 

^21 ^22 



is hermitian, IIA21II2 < (Amax(^) - \min{A))/2. 

{A) — Amin(^) is tiny) then the matrix must 



In other words if the eigenvalues are tightly clustered (A 
be nearly diagonal. 



Proof. Choose unit vectors x and y such that y^ A21X = \\A21 II2, and let z± = 2 



X 



. Then zlAz^ 



.5x^Aiix + .5?/^A22^± 11^21 lb- So by the Courant-Fischer min-max theorem, Amax(^) — Amin(^) > z^Azj^- 



_Az. 



2||^2l| 



□ 



3.5 High level strategy for divide-and-conquer, general pencil case 

Now we consider the case of a general square pencil defined by (A, B\ and how to reduce it to (block) upper 
triangular form by dividing- and-conquering the spectrum in an analogous way to the previous sections. Of 
course, in this generality the problem is not necessarily well-posed, for if the pencil is singular, i.e. the 
characteristic polynomial det(A — XB) = independent of A, then every number in the extended complex 
plane can be considered an eigenvalue. Furthermore, arbitrarily small perturbations to A and B can make 
the pencil regular (nonsingular) with at least one (and sometimes all) eigenvalues located at any desired 
point (s) in the extended complex plane. (For example, suppose A and B are both strictly upper triangular, 
so we can change their diagonals to be arbitrarily tiny values with whatever ratios we like.) 

The mathematically correct approach in this case is to compute the Kronecker Canonical Form (or rather 
the variant that only uses orthogonal transformations to maintain numerical stability [13 [18]), a problem 
that is beyond the scope of this paper. 

The challenge of singular pencils is shared by any algorithm for the generalized eigenproblem. The 
conventional QZ iteration may converge, but tiny changes in the inputs (or in the level or direction of 
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roundoff) will generally lead to utterly different answers. So "successful convergence" does not mean the 
answer can necessarily be used without further analysis. In contrast, our divide- and-conquer approach will 
likely simply fail to divide the spectrum when the pencil is close to a singular pencil. In the language of 
section [331 ^ singular pencil the distance to any splitting line or circle is zero, and the e-pseudospectrum 
includes the entire complex plane for any e > 0. When such repeated failures occur, the algorithm should 
simply stop and alert the user of the likelihood of near-singularity. 

Fortunately, the set of singular pencils form an algebraic variety of high codimension [15], i.e. a set of 
measure zero, so it is unlikely that a randomly chosen pencil will be close to singular. Henceforth we assume 
that we are not near a singular pencil. 

Even in the case of a regular pencil, we can still have eigenvalues that are arbitrarily large, or even 
equal to infinity, if B is singular. So we cannot expect to find a single finite bounding circle holding all the 
eigenvalues, for example using Gershgorin disks [43 , to which we could then apply the techniques of the last 
section, because many or all of the eigenvalues could lie outside any finite circle. Indeed, a Gershgorin disk 
in |43] has infinite radius as long as there is a common row of A and B that is not diagonally dominant. 

In order to exploit the divide-and-conquer algorithm for the interior of circles developed in the last 
section, we proceed as follows: We begin by trying to divide the spectrum along the unit circle in the 
complex plane. If all the eigenvalues are inside, we proceed with circles of radius , i.e. radius ^ and 
then repeatedly squaring, until we find a smallest such bounding circle, or (in a finite number of steps) run 
out of exponent bits in the floating point format (which means all the eigenvalues are in a circle of smallest 
possible representable radius, and we are done). 

Conversely, if all the eigenvalues are outside, we proceed with circles of radius 2^ , i.e. radius 2 and 
repeatedly squaring, again until we either find a largest such bounding circle (all eigenvalues being outside), 
or we run out of exponent bits. At this point, we swap A and 5, taking the reciprocals of all the eigenvalues, 
so that they are inside a (small) circle, and proceed as in the last section. If the spectrum splits along a 
circle of radius r, with some inside (say those of An — XBn) and some outside (those of A22 — A522), we 
instead compute eigenvalues of B22 — XA22 inside a circle of radius K 

3.6 Related work 

As mentioned in the introduction, there have been many previous approaches to parallelizing eigenvalue/singular 
value computations, and here we compare and contrast the three closest such approaches to our own. These 
three algorithms are the PRISM project algorithm [6 , the reduction of symmetric EIG to matrix multiplica- 
tion by Yau and Lu [35] , and the matrix sign function algorithm originally formulated in [26l [T2[ |36| |37l |38] . 

At the base of each of the approaches is the idea of computing, explicitly or implicitly, a matrix function 
f{A) that maps the eigenvalues of the matrix (roughly) into and 1, or and 00. 

3.6.1 The Yau-Lu algorithm 

This algorithm essentially splits the spectrum in the symmetric case (as explained below), but instead of 
divide-and-conquer, which only aims at reducing the problem into two subproblems of comparable size, the 
Yau-Lu algorithm aims to break the problem into a large number of small subproblems, which can then be 
solved by classical methods. 

The starting idea is put all the eigenvalues on the unit circle by computing B = e*^; next, they use a 
Chebyshev polynomial Pn{z) of high degree (TV ^ 8n) to evaluate u{\) = PN{e~'^^B)vo for A = ^ for 
/c = 0, . . . , — 1. If A is close to an eigenvalue, u{X) will be close to the eigenvector {Pn{z) has a peak at 
2; = 1, and is very close to everywhere except a small neighborhood of z = 1 on the unit circle). They use 
the FFT to compute u{X) simultaneously for all A, which makes things a lot faster (0(n^ log2n) instead of 
0(n^)). This works only under the strong assumption that by scaling the eigenvalues so as to be between 
and 27r, the gaps between them are all of order about 0(l/n). 

Finally, they extract the most likely candidates for eigenvectors from among the ii(A), split them into 
p orthogonal clusters, add more vectors if necessary, construct an orthogonal basis W whose elements span 
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invariant subspaces of A, and thus obtain that W^AW decouples the spectrum, making it necessary and 
sufficient to solve p small eigenvalue problems. 

The authors make an attempt to perform an error analysis as well as a complexity analysis; neither 
completely addresses the case of tightly clustered eigenvalues. Clustering is a well-known challenge for 
attaining orthogonality with reasonable complexity, as highlighted in [TO*, "20^. 

By contrast, in the symmetric case, for highly clustered eigenvalues, our strategy would produce small 
bounding intervals and output them together with the number of eigenvalues to be found in them, as 
explained in the previous section. 

3.6.2 Divide-and-conquer: the PRISM (symmetric) and sign function (both symmetric and 
non-symmetric) approaches 

The algorithms [6l [26l [121 EH] can be described as follows. Given a matrix A: 

Step 1. Compute, explicitly or implicitly, a matrix function f{A) that maps the eigenvalues of the matrix 
(roughly) into and 1, or and oo. 

Step 2. Do a rank-revealing decomposition (e.g. QR) to find the two spaces corresponding to the eigenvalues 
mapped to 1 and 0, or oo and 0, respectively; 

Step 3. Block-upper-triangularize the matrix to split the problem into two or more smaller problems; 

Step 4. Recur on the diagonal blocks until all eigenvalues are found. 

The strategy is thus to divide-and-conquer; issues of concern have to deal with the stability of the 
algorithm, the type and cost of the rank-revealing decomposition performed, the convergence of the first 
step, as well as the splitting strategy. We will try to address some of these issues below. 

In the PRISM approach (for the symmetric eigenvalue, and hence by extension the SVD) [6 , the authors 
choose Beta functions (polynomials which roughly approximate the step function from [0, 1] into {0, 1}, with 
the step taking place at 1/2) with or without acceleration techniques for their iteration. No stability analysis 
is performed, although there is reason to believe that some sort of forward stability could be achieved; 
the rank-revealing decomposition used is QR with pivoting. As such, there is also reason to believe that 
the approach is communication-avoiding, and that optimality can be reached by employing a lower-bound- 
achieving QR. 

Since f{A) is computed explicitly, one could potentially use a deterministic rank-revealing QR algorithm 
instead of our randomized one. There has been recent progress in developing such a QR algorithm that also 
minimizes communication [TT]. 

The other approach is to use the sign function algorithm introduced by [40]; this works for both the 
symmetric and the non- symmetric cases, but requires taking inverses, which leads to a potential loss of 
stability. This can, however, be compensated for by ad- hoc means, e.g., by choosing a slightly different 
location to split the spectrum. In either case, some assumptions must be made in order to argue stability. 
A more in-depth discussion of algorithms using this method can be found in Section 3 of [3]. 

3.6.3 Our algorithm 

Following in the footsteps of [3 and [14 , we have used the basic algorithm of [26| [l2l IMl [371 [38l. This 
algorithm was modified to be numerically stable and otherwise more practical (avoiding matrix exponentials) 
in 0. It was then reduced to matrix multiplication and dense QR in [14 by introducing randomization in 
order to achieve the same arithmetic complexity as fast matrix multiplication (e.g. 0(n^"^^) using Strassen's 
matrix multiplication). 

The algorithms we present here began as the algorithms in [TT and were modified to take advantage of the 
optimal communication-complexity O(n^) algorithms for matrix multiplication and dense QR decomposition 
(the latter of which was given in [16 ). 
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We also make here several contributions to the numerical analysis of the methods themselves. First, they 
depend on a rank-revealing generalized QR decomposition of the product of two matrices A~^B^ which is 
numerically stable and works with high probability. By contrast, this is done with pivoting in [3 , and the 
pivot order depends only on 5, whereas the correct choice obviously depends on A as well. We know of 
no way to stably compute a rank-revealing generalized QR decomposition of expressions like A~^B without 
randomization. More generally, we show how to compute a randomized rank revealing QR decomposition of 
an arbitrary product of matrices (and/or inverses); some of the ideas were used in a deterministic context 
in G.W. Stewart's paper [45 , with application to graded matrices. 

Second, we elaborate the divide-and-conquer strategy in considerably more detail than previous publica- 
tions; the symmetric/SVD and non-symmetric cases differ significantly. The symmetric/SVD case returns 
a list of eigenvalues/eigenvectors (singular values/vectors) as usual; in the nonsymmetric case, the ultimate 
output may not be a matrix in Schur form, but rather block Schur form along with a bound (most simply 
a convex hull) on the e-pseudospectrum of each diagonal block, along with an indication that further re- 
finement of the pseudospectrum is impossible without higher precision. For example, a single Jordan block 
would be represented (roughly) by a circle centered at the eigenvalue. A conventional algorithm would of 
course return n eigenvalues evenly spaced along the circular border of the pseudospectrum, with no indication 
(without further computation) that this is the case. Both basic methods are oblivious to the genus of the 
pseudospectrum, as they will fail to point out "holes" like the one present in the third picture of Figure [l] In 
certain cases (like convex or almost convex pseudospectra) the information returned by divide-and-conquer 
may be more useful than the conventional method. 

We emphasize that if a subproblem is small enough to solve by the standard algorithm using minimal 
communication, a list of its eigenvalues will be returned. So we will only be forced to return bounds 
on pseudospectra when the dimension of the subproblem is very large, e.g. does not fit in cache in the 
sequential case. 



4 Communication Bounds for Computing the Schur Form 



4.1 Lower bounds 



We now provide two different approaches to proving communication lower bounds for computing the Schur 
decomposition. Although neither approach yields a completely satisfactory algorithm-independent proof of 
a lower bound, we discuss the merits and drawbacks as well as the obstacles associated with each approach. 

The goal of the first approach is to show that computing the QR decomposition can be reduced to 
computing the Schur form of a matrix of nearly the same size. Such a reduction in principle provides a way 
to extend known lower bounds (communication or arithmetic) for QR decomposition to Schur decomposition, 
but has limitations described below. 

We outline below a reduction of QR to a Schur decomposition followed by a triangular solve with multiple 
right hand sides (TRSM). There are two main obstacles to the success of this approach. First, the TRSM 
may have a nontrivial cost so that proving the cost of QR is at most the cost of computing Schur plus 
the cost of TRSM is not helpful. Second, all known proofs of lower bounds for QR apply to a restricted 
set of algorithms which satisfy certain assumptions, and thus the reduction is only applicable to Schur 
decomposition and TRSM algorithms which also satisfy those assumptions. 

The first obstacle can be overcome with additional assumptions which are made explicit below. We do 
not see a way to overcome the second obstacle with the assumptions made in the current known lower bound 
proofs for QR decomposition, but future, more general proofs may be more amenable to the reduction. 
We now give the reduction of QR decomposition to Schur decomposition and TRSM. Let R be m-hy-m 

R 
X 



and upper triangular, X be n-by-m and dense, A 
(m + n)-by-(m + n). Let 
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where R and Qii are both m-by-m, and Q22 is n-hy-n. Note that Qii and R are both upper triangular. 
Then it is easy to confirm that the Schur decomposition of B is given by 

Til T12 

Qnxm Qnxn 

Since Tn is the upper triangular product of R and Qii, we can easily solve for R = Tn • Qii. This leads 
to our reduction algorithm S2QR of the QR decomposition of A to Schur decomposition of B outline below. 
Note that here Schur stands for any black-box algorithm for computing the respective decomposition. In 
addition, TRSM stands for "triangular solve with multiple right-hand-sides", which costs O(m^) flops, and 
for which communication-avoiding implementations are known [4 . 



-B 



R ' Qii R ' Q12 

Qnxm Qnxn 



Algorithm 7 Function [Q, R] =S2QR([i?; X]), where R is n x n upper triangular and X is m x n 



1: Form B = ^ ^ 
A (J 

2: Compute [g,T] =Schur(5) 

3: Qii = (5(1 : m, 1 : m); Tu = T(l : m, 1 : m) 

4: ^=TRSM[Qii,Tii] 

5: return [Q, R] 



This lets us conclude that 

Cost((m + n) X (m + n) Schur) > Cost((m + n) x (m) S2QR) - Cost((m) x (m) TRSM) 

We assume now that we are using an "O(n^) style" algorithm, which means that the cost of S2QR is 
Vt{m?n)^ say at least cim?n for some constant ci. Similarly, the cost of TRSM is O(m^), say at most C2m^. 
We choose m to be a suitable fraction of n, say m = min(.5ci/c2, l)n, so that 

Cost((2n) X (2n) Schur) > .5cim^n = Q{n^) 

as desired. 

We note that this proof can provide several kinds of lower bounds: 

(1) It provides a lower bound on arithmetic operations, by choosing ci to bound the number of arithmetic 
operations in S2QR from below [16 , and C2 = 1/3 to bound the number of arithmetic operations in 
TRSM from above [4]. 

(2) It can potentially provide a lower bound on the number of words moved (or the number of messages) 
in the sequential case, by taking ci and C2 both suitably proportional to 1/M^/^ (or to 1/M^/^). We 
note that the communication lower bound proof in [4] does not immediately apply to this algorithm, 
because it assumed that the QR decomposition was computed (in the conventional backward- stable 
fashion) by repeated pre-multiplications of A by orthogonal transformation, or by Gram- Schmidt, or 
by Cholesky-QR. However, this does not invalidate the above reduction. 

(3) It can similarly provide a lower bound for all communication in the parallel case, assuming that the 
arithmetic work of both S2QR and TRSM are load balanced, by taking ci and C2 both proportional 
to 1/P. 

The second approach to proving a communication lower bound for Schur decomposition is to apply the 
lower bound result for applying orthogonal transformations given by Theorem 4.2 in [4 . We must assume 
that the algorithm for computing Schur form performs O(n^) flops which satisfy the assumptions of this 
result, to obtain the desired communication lower bound. This approach is a straightforward application of 
a previous result, but the main drawback is the loss of generality. Fortunately, this lower bound does apply 
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to the spectral divide- and- conquer algorithm (by applying it just to the QR decompositions performed) as 
well as the first phase of the successive band reduction approach discussed in Section [6j this is all we need 
for a lower bound on the entire algorithm. But it does not necessarily apply to all future algorithms for 
Schur form. 

The results of Theorem 4.2 in [4 require that a sequence of Householder transformations are applied to 
a matrix (perhaps in a blocked fashion) where the Householder vectors are computed to annihilate entries 
of a (possibly different) matrix such that forward progress is maintained, that is, no entry which has been 
previously annihilated is filled in by a later transformation. Therefore, this lower bound does not apply 
to "bulge-chasing," the second phase of successive band reduction, where a banded matrix is reduced to 
tridiagonal form. But it does apply to the reduction of a full symmetric matrix to banded form via QR 
decomposition and two-sided updates, the first phase of the successive band reduction algorithms in Section[6j 

Reducing a full matrix to banded form (of bandwidth b) requires F = + bn'^ — |6^n + ^b^ flops, so 

assuming forward progress, the number of words moved is at least ft I -y= ) , so for b at most some fraction 



of n, we obtain a communication lower bound of Vt 




4.2 Upper bounds: sequential case 

In this section we analyze the complexity of the communication and computation of the algorithms described 
in Section [3] when performed on a sequential machine. We define M to be the size of the fast memory, a 
and /3 as the message latency and inverse bandwidth between fast and slow memory, and 7 as the flop rate 
for the machine. We assume the matrix is stored in contiguous blocks of optimal blocksize 6x6, where 
b = 0(a/M). We will ignore lower order terms in this analysis, and we assume the original problem is too 
large to fit into fast memory (i.e., ar? > M for a = 1, 2 or 4). 

4.2.1 Subroutines 

The usual blocked matrix multiplication algorithm for multiplying square matrices of size n x n has a total 
cost of 

CMM(n) = a.o(^)+/^.o(-^)+7-0(n3). 

From |16], the total cost of computing the QR decomposition using the so-called Communication- Avoiding 
QR algorithm (CAQR) of an m x n matrix (m > n) is 

CQR{m,n) = a-Oi^-^^ + /3 • O (^-^ j +^-0{mv?) . 

This algorithm overwrites the upper half of the input matrix with the upper triangular factor but does not 
return the unitary factor Q in matrix form (it is stored implicitly). To see that we may compute Q explicitly 
with only a constant factor more communication and work, we could perform CAQR on the following block 
matrix: 



A 


/" 




Q 


0" 




R 















/ 











In this case, the block matrix is overwritten by an upper triangular matrix that contains R and the 
conjugate-transpose of Q. For A of size n x this method incurs a cost of CQR{2n^2n) = OiCQRin^n)). 
(This is not the most practical way to compute Q, but is sufficient for a O(-) analysis.) We note that an 
analogous algorithm (with identical cost) can be used to obtain RQ, QL, and LQ factorizations. 

Computing the randomized rank-revealing QR decomposition (Algorithm [T]) consists of generating a 
random matrix, performing two QR factorizations (with explicit unitary factors), and performing one matrix 
multiplication. Since the generation of a random matrix requires no more than O(n^) computation or 
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communication, the total cost of the RURV algorithm is 



CnuRvin) = 2 • CQR{n, n) + CMM{n) 

.3 \ / ^3 



We note that this algorithm returns the unitary factor U and random orthogonal factor V in explicit form. 
The analogous algorithm used to obtain the RULV decomposition has identical cost. 

Performing implicit repeated squaring (Algorithm [3| consists of two matrix additions followed by iterating 
a QR decomposition of a 2n x n matrix followed by two n x n matrix multiplications until convergence. 
Checking convergence requires computing norms of two matrices, constituting lower order terms of both 
computation and communication. Since the matrix additions also require lower order terms and assuming 



convergence after some constant p iterations (depending on e, the accuracy desired, as per Corollaries 3.7 



and 3.8), the total cost of the IRS algorithm is 

CiRs{n) = p-{CQR{2n,n)^2-CMM{n)) 

Choosing the dimensions of the subblocks of a matrix in order to minimize the 1-norm of the lower 
left block (as in computing the n — 1 possible values of ||£^2i ||i/||^||i + ||^2i ||i/||^||i in the last step of 
Algorithm [4| can be done with a blocked column-wise prefix sum followed by a blocked row-wise max 
reduction (the Frobenius and max-row-sum norms can be computed with similar two-phase reductions). 

This requires O(n^) work, O(n^) bandwidth, and O (^j^^ latency, as well as O(n^) extra space. Thus, the 

communication costs are lower order terms when an^ > M. 



4.2.2 Randomized Generalized Non-symmetric Eigenvalue Problem (RGNEP) 

We consider the most general divide-and-conquer algorithm; the analysis of the other three algorithms differ 
only by constants. One step of RGNEP (Algorithm |4| consists of 2 evaluations of IRS, 2 evaluations of 
RURV, 2 evaluations of QR decomposition, and 6 matrix multiplications, as well as some lower order work. 
Thus, the cost of one step of RGNEP (not including the cost of subproblems) is given by 

Crgnep* {n) = 2 • Cmsin) + 2 • CRURvin) + 2 • CQR{n, n) + 6 • CMM{n) 

Here and throughout we denote by Calg* the cost of a single divide-and-conquer step in executing ALG. 

Assuming we split the spectrum by some fraction /, one step of RGNEP creates two subproblems of size 
fn and (1 — f)n. If we continue to split by fractions in the range [1 — /o, /o] at each step, for some threshold 
^ < /o < 1, the total cost of the RGNEP algorithm is bounded by the recurrence 

rf \ ^ (/on) + C ((1 - fo)n) + Crgnep* (n) if 2n^ > M 
^^^^-\ a.O(l)+/3-0(n2)+7.0(n3) if 2n^ < M 

where the base case arises when input matrices A and B fit into fast memory and the only communication 
is to read the inputs and write the outputs. This recurrence has solution 

Note that while the cost of the entire problem is asymptotically the same as the cost of the first divide- 
and-conquer step, the constant factor for C{n) is bounded above by times the constant factor 
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for CRGNEP*{n). That is, for constant A defined such that CRGNEP*{n) < An^^ we can verify that 
C(n) < Of s \ Ti^ satisfies the recurrence: 



C(n) < C{fon) + C{{l- U)n)+CRGNEP*{n) 
Ul + (1 - k?)An^ + (3/o(l - k))An^ 



< 



< —:r-. r^n^. 



3/0(1-/0) 

3/0(1-/0)' 



4.3 Upper bounds: parallel case 

In this section we analyze the complexity of the communication and computation of the algorithms described 
in Section [3] when performed on a parallel machine. We define P to be the number of processors, a and P as 
the message latency and inverse bandwidth between any pair of processors, and 7 as the flop rate for each 
processor. We assume the matrices are distributed in a 2D blocked layout with square blocksize h — As 
before, we will ignore lower order terms in this analysis. After analysis of one step of the divide-and-conquer 
algorithms, we will describe how the subproblems can be solved independently. 



4.3.1 Subroutines 

The SUMMA matrix multiplication algorithm [25 for multiplying square matrices of size n x n, with 
blocksize 6, has a total cost of 



CuMin, P) = a • O log P) + /3 • O log Pj + 7 • O 



From [16 (equation (12) on page 62), the total cost of computing the QR decomposition using CAQR 
of an n X n matrix, with blocksize 6, is 

/ 77/ \ / 77^ \ / 77 Ti^h 

CQfl(n, n,P) = a-0 (-log Pj+/3-0f-^ log Pj + ^ . O ( - + log P 

The cost of computing the QR decomposition of a tall skinny matrix of size 277 x 77 is asymptotically the 
same. 

As in the sequential case, RURV and IRS consist of a constant number of subroutine calls to the QR 
and matrix multiplication algorithms, and thus they have the following asymptotic complexity (here we 
assume a blocked layout, where h = ^)- 

CRURv{n, P) = a • O (Vp log P) + /3 • O log P^ + 7 • O (^^ log P 



and 



CiRs(n.P) = a O (v^logP) + /3 • O logP^ + 7 • O (j^ logp) . 



Choosing the dimensions of the subblocks of a matrix in order to minimize the 1-norm of the lower 
left block (as in computing the 77 — 1 possible values of ||^2i||i/||^||i + ||P2i||i/||P||i in the last step of 
Algorithm |4| can be done with a parallel prefix sum along processor columns followed by a parallel prefix 
max reduction along processor rows (the Frobenius and max-row-sum norms can be computed with similar 

two-phase reductions). This requires O (logP) messages, O logP^ words, and O logP^ arithmetic, 
which are all lower order terms, as well as O extra space. 
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1- — 1 



Figure 7: Assignment of processors to subproblems. The solid lines represent the blocked layout on 25 
processors, the dotted line represents the subproblem split, and processors are color-coded for the subproblem 
assignment. One idle (white) processor is assigned to the smaller subproblem and receives the corresponding 
data from the processor owning the split. 

4.3.2 Randomized Generalized Non-symmetric Eigenvalue Problem (RGNEP) 

Again, we consider the most general divide-and-conquer algorithm; the analysis of the other three algorithms 
differ only by constants. One step of RGNEP (Algorithm [4| requires a constant number of the above 
subroutines, and thus the cost of one step of RGNEP (not including the cost of subproblems) is given by 

Crgnep* (n, P) = a • O (Vp log p) + /3 • O log + 7 • O (^^ log 

Assuming we split the spectrum by some fraction /, one step of RGNEP creates two subproblems of 
sizes fn and (1 — f)n. Since the problems are independent, we can assign one subset of the processors to 
one subproblem and another subset of processors to the second subproblem. For simplicity of analysis, we 
will assign /^P processors to the subproblem of size fn and (1 — /)^P processors to the subproblem of size 
(1 — f)n. This assignment wastes 2/(1 — /)P processors (^ of the processors if we split evenly), but this will 
only affect our upper bound by a constant factor. 

Because of the blocked layout, we assign processors to subproblems based on the data that resides in their 
local memories. Figure [7| shows this assignment of processors to subproblems where the split is shown as a 
dotted line. Because the processors colored light gray already own the submatrices associated with the larger 
subproblem, they can begin computation without any data movement. The processor owning the block of 
the matrix where the split occurs is assigned to the larger subproblem and passes the data associated with 
the smaller subproblem to an idle processor. This can be done in one message and is the only communication 
necessary to work on the subproblems independently. 

If we continue to split by fractions in the range [1 — /o, /o] at each step, for some threshold | < /o < 1, we 
can find an upper bound of the total cost along the critical path by accumulating the cost of only the larger 
subproblem at each step (the smaller problem is solved in parallel and incurs less cost). Although more 
processors are assigned to the larger subproblem, the approximate identity (ignoring logarithmic factors) 

CRGNEP^ifn, fP) ^ / • CRGNEP^{n, P) 

implies that the divide-and-conquer step of the larger subproblem will require more time than the corre- 
sponding step for the smaller subproblem. If we assume that the larger subproblem is split by the largest 
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fraction /o at each step, then the total cost of each smaher subproblem (including subsequent splits) will 
never exceed that of its corresponding larger subproblem. Thus, an upper bound on the total cost of the 
RGNEP algorithm is given by the recurrence 

r(„ P\-i C (/0«' fo^P) + C RGNEP' (n, P) if P > 1 

^i^'^^- I ^.0(n3) ifP=l 

where the base case arises when one processor is assigned to a subproblem (requiring local computation but 
no communication). The solution to this linear recurrence is 

C(n, P) = a • O logP) + /? • O logp) + 7 • O (^^ + ^ logp) . 

Here the constant factor for the upper bound on the total cost of the algorithm is ^~-['_f^ times larger than 
the constant factor for the cost of one step of divide-and-conquer. This constant arises from the summation 
X^iLo /o* where d = logj^ is the depth of the recurrence (we ignore the decrease in the size of the 
logarithmic factors). 

Choosing the blocksize to be 6 = ^=7^ — - , we obtain a total cost for the RGNEP algorithm of 

" \/P log P ^ 

C(n, P) = a-0 (a/P log2 p) + /? • O log P^ + 7 ■ O (^^^ . 

We will argue in the next section that this communication complexity is within polylogarithmic factors of 
optimal. 

5 Computing Eigenvectors of a Triangular Matrix 

After obtaining the Schur form of a nonsymmetric matrix A = QTQ* ^ finding the eigenvectors of A requires 
computing the eigenvectors of the triangular matrix T and applying the unitary transformation Q. Assuming 
all the eigenvalues are distinct, we can solve the equation TX = XD for the upper triangular eigenvector 
matrix X, where I) is a diagonal matrix whose entries are the diagonal of T. This implies that for i < j, 

i-i 



kj 

X,, = (8) 

-'-33 



where Xjj can be arbitrarily chosen for each j. We will follow the LAPACK naming scheme and refer to 
algorithms that compute the eigenvectors of triangular matrices as TREVC. 

5.1 Lower bounds 

Equation [s] matches the form specified in [4 , where the gijk functions are given by the multiplications 
Tij. • X/cj. Thus we obtain a lower bound of the communication costs of any algorithm that computes these 
O(n^) multiplies. That is, on a sequential machine with fast memory of size M, the bandwidth cost is 

(^-^=^ and the latency cost is ^ ^3/2 ^ • a parallel machine with P processors and local memory 
bounded by O {^^^ ^^e bandwidth cost is Vt (^'^^ ^^e latency cost is Vt (^V^^ • 

5.2 Upper bounds: sequential case 

The communication lower bounds can be attained in the sequential case by a blocked iterative algorithm, 
presented in Algorithm [Sj The problem of computing the eigenvectors of a triangular matrix, as well as 



25 



the formulation of a blocked algorithm, appears as a special case of the computations presented in [29] : 
however, [29 does not try to pick a blocksize in order to minimize communication, nor does it analyze the 
communication complexity of the algorithm. 

For simplicity we assume the blocksize b divides n evenly, and we use the notation X[i^j] to refer to the 
b X b block of X in the i^^ block row and j^^ block column. In this section and the next we consider the 
number of flops, words moved, and messages moved separately and use the notation Aalg to denote the 
arithmetic count, Balg to denote to the bandwidth cost or word count, and Lalg to denote the latency 
cost or message count. 



Algorithm 8 Blocked Iterative TREVC 

Require: T is upper triangular, D is diagonal with diagonal entries of T; 
all matrices are blocked with blocksize b < ^jMjZ which divides n evenly 
1: for j = 1 to njb do 

2: solve T[j,i] * = j] * D[j,i] for j] {read T[i, j], write X{j,j\) 

3: for z = j — 1 down to 1 do 
4: S' = 

5: for /c = i + 1 to j do 

6: S = S^ T[i, k] * X[kJ] {read T[i, k], read X[kJ]} 

7: end for {read T[i,i], read D[j^j], write j]} 
8: end for 
9: end for 

Ensure: X is upper triangular and TX = XD 



Since each block has 0(6^) words, an upper bound on the total bandwidth cost of Algorithm [s] is given 
by the following summation and inequality: 



n/b 



BTREVcin) = ^ 



y: [o{b')]^o{b') 



./c=i+l 



< 



= ol^-^+n'+nbj. 

If each block is stored contiguously, then each time a block is read or written, the algorithm incurs a cost 
of one message. Thus an upper bound on the total latency cost of Algorithm [S] is given by 



n/b 

LrREVcip) = E 
= O 



^ [0(1)] + 0(1) 

./c=i+l 



63 62 ' 6^ 

Setting b = 9(>/M), such that b < ^/M/3^ we attain the lower bounds above (assuming > M). 



5.3 Upper Bounds: Parallel Case 

There exists a parallel algorithm, presented in Algorithm |9j requiring 0{in? /P) local memory that attains 
the communication lower bounds given above to within poly logarithmic factors. We assume a blocked 
distribution of the triangular matrix T {b = nj^fP) and the triangular matrix X will be computed and 
distributed in the same layout. We also assume that P is a perfect square. Each processor will store a block 
of T and A, three temporary blocks T', A', and 5, and one temporary vector D. That is, processor j) 
stores Ti^, A^^-, T'--, A^, Si^, and Aj- 
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The algorithm iterates on block diagonals, starting with the main block diagonal and finishing with the 
top right block, and is "right-looking" (i.e. after the blocks along a diagonal are computed, information is 
pushed up and to the right to update the trailing matrix). Blocks of T are passed right one processor column 
at a time, but blocks of X must be broadcast to all processor rows at each step. 

Algorithm 9 Parallel Algorithm PTREVC 

Require: T is upper triangular and distributed in blocked layout, D is diagonal with diagonal entries of T 
(no extra storage needed) 
for all processors (^, j) such that i > j do 
set T'.. = Tij 
set Sij = 
end for 

for all processors (j, j) do 

solve TjjXjj = XjjDjj for Xjj locally 

broadcast Xjj^Djj up processor column, store in local X' ^D' 
end for 

for A: = 1 to - 1 do 

for all processors (z, j) such that i— j>k — ldo 
k < j < \fP then 
send T[. T(^.+i (to right) 
end if 

if /c < j < then 

receive T/^_i ^ T'-- (from left) 
end if 

if i — j > k — 1 then 

update Sij = Sij + T^jX^j locally 
end if 
end for 

for all processors (i, j) such that i — j = k do 
solve Tl^Xij + Sij = XijD^j for Xij locally 
broadcast Xij up processor column, store in local X^ 
end for 
end for 

Ensure: X is upper triangular and distributed in blocked layout, TX = XD 



First we consider the arithmetic complexity. Arithmetic work occurs at lines [6j[l8j and 22 All of the for 
loops in the algorithm, except for the one starting at line [9j can be executed in parallel, so the arithmetic 
cost of Algorithm |9] is given by 




Vp \ O 




ATREVc{n,P) 



In order to compute the communication costs of the algorithm, we must make some assumptions on the 
topology of the network of the parallel machine. We will assume that the processors are arranged in a 2D 
square grid and that nearest neighbor connections exist (at least within processor rows) and each processor 
column has a binary tree of connections. In this way, adjacent processors in the same row can communicate 
at the cost of one message, and a processor can broadcast a message to all processors in its column at the 
cost of O(logP) messages. 



Communication occurs at lines |7||l2, 15, and 23 Since every message is of size 0(n^/P), the bandwidth 
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cost of Algorithm |9] is given by 
and the latency cost is 

LTREVc{n,P) = 0(logP) + V^(0(l) + 0(logP)) 

= o(VPlogP). 



6 Successive Band Reduction 

Here we discuss a different class of communication- minimizing algorithms, variants on the conventional 
reduction to tridiagonal form (for the symmetric eigenvalue problem) or bidiagonal form (for the SVD). These 
algorithms for the symmetric case were discussed at length in [7, 8 , which in turn refer back to algorithms 
originating in [4r, '42 . Our SVD algorithms are natural variations of these. These algorithms can minimize 
the number of words moved in an asymptotic sense on a sequential machine with two levels of memory 
hierarchy (say main memory and a cache of size M) when computing just the eigenvalues (respectively, 
singular values) or additionally all the eigenvectors (resp. all the left and/or right singular vectors) of a 
dense symmetric matrix (respectively, dense general matrix). Minimizing the latency cost, dealing with 
multiple levels of memory hierarchy, and minimizing both bandwidth and latency costs in the parallel case 
all remain open problems. 

The algorithm for the symmetric case is as follows. 

1. We reduce the original dense symmetric matrix A to band symmetric form, H = AQ where Q 
is orthogonal and H has bandwidth b. (Here we define b to count the number of possibly nonzero 
diagonals above the main diagonal, so for example a tridiagonal matrix has 6 = 1.) If only eigenvalues 
are desired, this step does the most arithmetic, + 0{in?) as well as 0{v? jM^I^) slow memory 
references (choosing b ~ \fM\ 

2. We reduce R to symmetric tridiagonal form T = U^HU by successively zeroing out blocks of entries 
of H and "bulge-chasing." Pseudocode for this successive band reduction is given in Algorithm \To\ 
and a scheme for choosing shapes and sizes of blocks to annihilate is discussed in detail below. This 
step does 0(n^M^/^) floating point operations and, with the proper choice of block sizes, 0(n^/M^/^) 
memory references. So while this step alone does not minimize communication, it is no worse than the 
previous step, which is all we need for the overall algorithm to attain 0(n^/M^/^) memory references. 
If eigenvectors are desired, then there is a tradeoff between the number of flops and the number of 
memory references required (this tradeoff is made explicit in Table [T]) . 

3. We find the eigenvalues of T and-if desired-its eigenvectors, using the MRRR algorithm [20^, which 
computes them all in 0{in?) flops, memory references, and space. Thus this step is much cheaper than 
the rest of the algorithm, and if only eigenvalues are desired, we are done. 

4. If eigenvectors are desired, we must multiply the transformations from steps 1 and 2 times the eigen- 
vectors from step 3, for an additional cost of 2n^ flops and 0{n^ / y/M) slow memory references. 

Altogether, if only eigenvalues are desired, this algorithm does about as much arithmetic as the conventional 
algorithm, and attains the lower bound on the number of words moved for most matrix sizes, requiring 
0(v? jM^I^^ memory references; in contrast, the conventional algorithm [1 moves Vl(v?^ words in the reduc- 
tion to tridiagonal form. However, if eigenvectors are also desired, the algorithm must do more floating point 
operations. For all matrix sizes, we can choose a band reduction scheme that will attain the asymptotic 
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lower bound for the number of words moved and increase the number of flops by only a constant factor (2 x 
or 2.6x, depending on the matrix size). 



Algorithm 10 Successive band reduction of symmetric banded matrix 
Require: A G is symmetric with bandwidth b = bi 

1: for i = 1 to s do 

2: {eliminate di diagonals from band of remaining width bi} 
3: for j = 1 to (n — bi)/ci do 

4: {eliminate q columns from bottom di diagonals} 

5: zero out diCi entries of parallelogram by orthogonal transformation (e.g. parallelograms 1 and 6 in 
Figure [8| 

6: perform two-sided symmetric update, creating bulge (e.g. Qi creates bulge parallelogram 2 in 

Figure [8| 
7: for k = 1 to {n — jci)/bi do 

8: "chase the bulge" (zero out just the diCi entries of the parallelogram) (e.g. parallelograms 2 and 

3 in Figure [8| 

9: perform two-sided symmetric update, creating bulge {(e.g. Qs creates bulge parallelogram 4 in 

Figure [§} 
10: end for 
11: end for 
12: bi^i =bi-di 
13: end for 

Ensure: A is symmetric and tridiagonal 



Table [T] compares the asymptotic costs for both arithmetic and communication of the conventional direct 
tridiagonalization with the two-step reduction approach outlined above. The analysis for Table [l] is given 
in Appendix [A| There are several parameters associated with the two-step approach. We let b denote the 
bandwidth achieved after the full-to-banded step. If we let 6 = 1 we attain the direct tridiagonalization 
algorithm. If we let b = 0(a/M), the full-to-banded step attains the communication lower bound. If 
eigenvectors are desired, then we form the orthogonal matrix Q explicitly (at a cost of flops) so that 
further transformations can be applied directly to Q to construct QU, where A = {QU)T (QU)^ . 

The set of parameters s, 6^, q, and di (for I < i < s) characterizes the scheme for reducing the banded 
matrix H to tridiagonal form via successive band reduction as described in [71 [8]. Over a sequence of s steps, 
we successively reduce the bandwidth of the matrix by annihilating sets of diagonals {di of them at the i^^ 
step). We let bi = b and 

i-l 

bi = b-'^dj. 

j=i 

We assume that we reach tridiagonal form after s steps, or that 

s 

Y,d^=b-l. 

i=l 

Because we are applying two-sided similarity transformations, we can eliminate up to bi — di columns of the 
last di subdiagonals at a time (otherwise the transformation from the right would flll in zeros created by the 
transformation from the left). We let q denote the number of columns we eliminate at a time and introduce 
the constraint 

Ci^di< bi. 

This process of annihilating a (i x c parallelogram will introduce a trapezoidal bulge in the band of the matrix 
which we will chase off the end of the band. We note that we do not chase the entire trapezoidal bulge; we 
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# flops 


# words moved 


Direct Tridiagonalization 
values 

vectors 




0(n3) 


Full-to-banded 
values 






vectors 






Banded-to-tridiagonal 






values 




o(^,.(,.+x:(i+|)..') 


vectors 


i=i 





Table 1: Asymptotic computation and communication costs of reduction to tridiagonal form. Only the 
leading term is given in the flop count. The cost of the conventional algorithm is given in the first block row. 
The cost of the two-step approach is the sum of the bottom two block rows. If eigenvectors are desired, the 
cost of each step is the sum of the two quantities given. The parameter t is defined such that n{bt-\-l) < M/4, 
or, if no such t exists, let t = s -\- 1. 




Figure 8: First pass of reducing a symmetric matrix from band to tridiagonal form. The numbers represent 
the order in which the parallelograms are eliminated, and Qi represents the orthogonal transformation which 
is applied to the rows and columns to eliminate the i*^ parallelogram. 
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eliminate only the first q columns (another d x c parallelogram) as that is sufficient for the next sweep not 
to fill in any more diagonals. These parameters and the bulge-chasing process are illustrated in Figure [8] 

We now discuss two different schemes (i.e. choices of {q} and {di}) for reducing the banded matrix to 
tridiagonal formj^ Consider choosing di = b — 1 and Ci = 1, so that 5 = 1, as in the parallel algorithm of [33]. 
Then the communication cost of the banded-to- tridiagonal reduction is 0{hn^). If we choose h = ai^/M for 
some constant ai < 1/2, then four b x b blocks can fit into fast memory at a time, and by results in [16], 

the communication cost of reduction from full to banded form is O (^:^^' The communication cost of the 
reduction from banded to tridiagonal (choosing di = b — 1) is 0(in?y/M)^ and so the total communication 



cost for reduction from full to tridiagonal form is O + ti^a/M^ • This algorithm attains the lower bound 

if the ffist term dominates the second, requiring n = Vt{M). Thus, for sufficiently large matrices, this scheme 
attains the communication lower bound and does not increase the leading term of the flop count from the 
conventional method if only eigenvalues are desired. 

If eigenvectors are desired, then the number of flops required to compute the orthogonal matrix QU 
in the banded-to-tridiagonal step is 2n^ and the number of words moved in these updates is 0(n^ /y/M). 
If T = VhV^ is the eigendecomposition of the tridiagonal matrix, then A = {QUV) h.{QUV)^ is the 
eigendecomposition of A. Thus, if the eigenvectors of A are desired, we must also multiply the explicit 
orthogonal matrix QU times the eigenvector matrix V, which costs an extra 2n^ flops and 0{n^ / \/M) words 
moved. Thus the communication lower bound is still attained, and the total flop count is increased to 
as compared to in the direct tridiagonalization. 

Note that if n < M, the v?\fM term dominates the communication cost and prevents the scheme above 
from attaining the lower bound. Consider values of n that are larger than \fM (so the matrix does not fit 
into fast memory) but smaller than M (so one or more diagonals do fit into fast memory). In this case, we 
reduce the banded matrix to tridiagonal in two steps (s = 2). First, choose d\ = b — and ci = 0^2"^, 

where a2 < 1/4 is a constant chosen so that (^2^ is an integer. This implies that 62 = <^2^^ so we choose 
d2 = <^2^ — I and C2 = 1, reducing to tridiagonal after two steps. 

Suppose only eigenvalues are desired. Note that 77,62 = <^2M^ so after the first pass, the band fits in fast 
memory. Note that we choose a2 < 1/4 so that the band itself fits in a quarter of fast memory, and another 
quarter of fast memory is available for bulges and other temporary storage. If eigenvectors are desired, then 
the other half of memory can be used in the computation of QU. Thus, the number of words moved in the 
reduction is O (77,62 + (l + <^i/ci)n^). If we choose b = OL\\fM for some constant ol\ < 1/2, then ^ = 6 (^^^^ 5 
and the total communication cost for reduction from full to tridiagonal form is 

O ( ^ + + M 



and since 71^ > M, the first term dominates and the scheme attains the lower bound. Again, the change in 
the number of flops in the reduction compared to the conventional algorithm is a lower order term. 

If eigenvectors are desired, then the extra flops required to update the orthogonal matrix Q to compute 
QU is 2(^ + ^)^^- Since di < bi^ this quantity is bounded by 477,^. Since we also have to multiply the 
updated orthogonal matrix QU times the eigenvector matrix V of the tridiagonal T (for a cost of another 
2n^ flops), this increases the total flop count to no more than ^77,^. The extra communication cost of these 

updates is O (^^^^ words, thus maintaining the asymptotic lower bound 

-•^As shown in Figure [sj for all choices of {q} and {di}, we chase all the bulges created by the first parallelogram (numbered 
1) before annihilating the adjacent one (numbered 6). It is possible to reorder the operations that do not overlap (say eliminate 
parallelogram 6 before eliminating the bulge parallelogram 4), and such a reordering may reduce memory traffic. However, for 
our purposes, it will be sufficient to chase one bulge at a time. 

^Updating the matrix Q must be done in a careful way to attain this communication cost. See Appendix^for the algorithm. 
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Recovering eigenvector matrix 






vectors 






Total costs 






values only 






values and vectors 


fn3 


0(A) 



Table 2: Asymptotic computation and communication costs of reduction to tridiagonal form in the case of 
\/M < n < M. The parameters ai < 1/2,^2 < 1/4 are constants. Only the leading term is given in the 
flop count. The costs correspond to the specific choices of 6, di^ ci, and C2 listed in the first column. 



We may unify the two schemes for the two relative sizes of n and M as follows. Choose 

hi = aiVu 
b2 

di = hi - 62 
ci = h2 



max < 0^2 — , 1 r 
I ^ J 



where ai < 1/2, 0^2 < 1/4 are constants, and, if necessary (62 > 1), 

= h2 - 1 
C2 = 1. 

In this way we attain the communication lower bound whether or not eigenvectors are desired. If only 
eigenvalues are desired, then the two-step approach does + O(n^) arithmetic which matches the cost of 
direct tridiagonalization. If eigenvectors are also desired, the number of fiops required by the SBR approach 
is bounded by either y or depending on the number of passes used to reduce the band to tridiagonal, 
as compared to the fiops required by the conventional algorithm. 



7 Conclusions 

We have presented numerically stable sequential and parallel algorithms for computing eigenvalues and 
eigenvectors or the SVD, that also attain known lower bounds on communication, i.e. the number of words 
moved and the number of messages. 

The first class of algorithms, which do randomized spectral divide- and- conquer^ do several times as much 
arithmetic as the conventional algorithms, and are shown to work with high probability. Depending on 
the problem, they either return the generalized Schur form (for regular pencils), the Schur form (for single 
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Table 3: Asymptotic computation and communication costs of reduction to tridiagonal form in the case of 
n > M. The constant ai is chosen to be smaller than 1/2. Only the leading term is given in the flop count. 
The costs correspond to the speciflc choices of 6, di, and ci listed in the flrst column. 

matrices), or the SVD. But in the case of nonsymmetric matrices or regular pencils whose e-pseudo-spectra 
include one or more large connected components in the complex plane containing very many eigenvalues, it 
is possible that the algorithm will only return convex enclosures of these connected components. In contrast, 
the conventional algorithm (Hessenberg QR iteration) would return eigenvalue approximations sampled from 
these components. Which kind of answer is more useful is application dependent. For symmetric matrices 
and the SVD, this issue does not arise, and with high probability the answer is always a list of eigenvalues. 

A remaining open problem is to strengthen the probabilistic analysis of Theorem |2.1[ which we believe 
underestimates the likelihood of convergence of our algorithm. 

Given the Schur form of a nonsymmetric matrix or matrix pencil, we can also compute the eigenvectors 
in a communication-optimal fashion. 

Our last class of algorithms uses a technique called successive hand reduction (SBR)^ which applies 
only to the symmetric eigenvalue problem and SVD. SBR deterministically reduces a symmetric matrix to 
tridiagonal form (or a general matrix to bidiagonal form) after which a symmetric tridiagonal eigensolver 
(resp. bidiagonal SVD) running in O(n^) time is used to complete the problem. With appropriate choices 
of parameters, sequential SBR minimizes the number of words moved between 2 levels of memory hierarchy; 
minimizing the number of words moved in the parallel case, or minimizing the number of messages in 
either the sequential or parallel cases are open problems. SBR on symmetric matrices only does double the 
arithmetic operations of the conventional, non-communication-avoiding algorithm when the matrix is large 
enough (or 2.6x the arithmetic when the matrix is too large to flt in fast memory, but small enough for one 
row or column to flt). 

We close with several more open problems. First, we would like to extend our communication lower 
bounds so that they are not algorithm-speciflc, but algorithm-independent, i.e. apply for any algorithm for 
solving the eigenvalue problem or SVD. Second, we would like to do as little extra arithmetic as possible 
while still minimizing communication. Third, we would like to implement the algorithms discussed here and 
determine under which circumstances they are faster than conventional algorithms. 
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A Analysis for Table [l] 
A.l Direct Tridiagonalization 

We now provide the analysis for the communication costs presented in Table [l] The flop counts for direct 
tridiagonalization are well-known, see [10 , for example. Because direct tridiagonalization requires n sym- 
metric matrix- vector products (each requires moving O(n^) data to access the matrix), the total number 
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of words moved in reducing the symmetric matrix to tridiagonal is O(n^). After this step, the orthogonal 
matrix whose similarity transformation tridiagonalizes A is not stored explicitly, but as a set of Householder 
vectors]^ Applying these Householder vectors to the eigenvector matrix of the tridiagonal matrix to recover 

the eigenvectors of A can be done in the usual blocked way and thus costs 2n^ and O (^^^^ extra memory 
references if the blocksize is chosen to be 0(a/M)- 



A. 2 Counting Lemmas 

To simplify the analysis for the rest of the table, we point out two useful facts. We also note that blocking 
Householder updates does introduce extra flops, but because the number of extra flops is a lower order term, 
we ignore them here and focus on BLAS-2 updates. 

Lemma A.l. The leading term of the number of flops required to apply a Householder transformation of a 
vector u with h nonzeros to c columns (or rows if the applying from the right) of a matrix A is 4hc. 

Proof. Let A be m x c (some subset of the columns of A). Applying the transformation associated with 
Householder vector u is equivalent to overwriting the matrix A with A — tu{u^ A). Since u has h nonzeros, the 
cost of computing A is 2hc which yields a dense row vector of length c. Computing tu costs h multiplies. 
Finally, he entries of A must be updated with one multiplication and one subtraction each. Thus, the total 
cost of the update is Ahc -\- h. □ 

Lemma A. 2. The cost of applying a Householder transformation of a vector with h nonzeros from the left 
and right to a symmetric matrix is the same as applying the Householder transformation from the left to a 
nonsymmetric matrix of the same size. 

Proof. Let Ahe nx n and symmetric, and let be a Householder vector with h nonzeros. As pointed out in 
[21], the performing the following three computations overwrites the matrix A with {I — ruu^)A{I — ruu^)^ : 

y ^ Au 

A ^ A - uv^ - vu^ 

The first computation costs 2hn flops, and the second line costs 4/i + 1. Since each of the matrices uv^ 
and vu^ have hn nonzeros, each entry of A which corresponds to a nonzero in one of the matrices must be 
updated with a multiplication and a subtraction. However, since A is symmetric, we need to update only 
half of its entries, so the cost of the third line is 2hn. Thus, the total cost (ignoring lower order terms) is 
Ahn which is the same as the cost of applying the Householder transformation from one side to an n x n 
nonsymmetric matrix. □ 



A. 3 Reduction from Full to Banded 

The second row of Table [l] corresponds to the first step of the SBR approach, reducing a full symmetric matrix 
to banded form by a sequence of QR factorizations on panels followed by two-sided updates. Algorithm [TT] 
describes the reduction of a full symmetric matrix to banded form. Here TSQR refers to the Tall-Skinny 
QR decomposition described in [16], which returns Q in a factored form whose details do not concern us 
here. The multiplication by and Q in the next two lines assumes Q is in this factored form and exploits 



the symmetry of A. See [34 for the details of a multicore implementation of this algorithm. By Lemma A. 2 



the flop costs of the two-sided updates are the same as a one-sided update to a non-symmetric matrix. At 

2^ 



the k^^ step of the reduction, the QR is performed on a panel of size (n — kb) x 6, yielding (n — kb)b — \b'^ 



^Note that the exphcit orthogonal matrix can be generated in flops if desired. 
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Householder entries which are then apphed to n — kb -\- b rows and columns. Thus, by Lemma |A.1[ the 
arithmetic cost of the reduction is roughly 

n/b-2 1 / 1 \ 

V 2(n - kb)b^ - -6^ + 4 ( (n - kb)b - -b^ ]{n-kb^b) = -n^ + 0{bn^). 
k=i 3 V 2 y 3 

By results for TSQR and applying Q and in Appendix B of [16 , all of these flops may be organized in a 
blocked fashion to achieve a communication cost of O (^x) ^^^^^5 where b < \fM /2. Note that b < \fM /2 
implies that 46^ < M, or four 6x6 blocks fit into fast memory at the same time. 

Algorithm 11 Reduction of A = from dense to band form with bandwidth b 

Require: A G is symmetric, and assume that b divides n for simplicity 

1: for z = lton — 26 + 1 step b do 
2: [Q, R\ = TSQR{A{i H- 6 : n, z : i + 6 - 1)) 
3: A{i^b: n, :) = ■ A{i ^ b : n, :) 
4: A{:, i^b :n) = i ^ b : n) ■ Q 
5: end for 

Ensure: A is symmetric and banded with bandwidth b 



If eigenvectors are desired, the orthogonal matrix Q that reduces the full matrix to banded form via 
similarity transformation is formed explicitly. In this way, the many small Householder transformations of 
the banded-to-tridiagonal step can be applied to Q as they are computed and then discarded. The arithmetic 
cost of forming Q explicitly is that of applying the Householder transformations to the identity matrix in 
reverse order as they were computed]^ Since the number of Householder entries used to reduce the second- 
to-last column panel is roughly and the number of rows on which these transformations operate is 6, 
the transformations need be applied only to the last b columns of the identity matrix. The next set of 
transformations, computed to reduce the third-to-last column panel of A, operate on the last 2b rows. Thus, 
they need to be applied to only the last 2b columns of the identity matrix. Since the number of Householder 
entries used to reduce the column panel is roughly (n — ib)b — and the number of columns to be 



updated is (n — z6), by Lemma A.l, the total number of flops for constructing Q explicitly is given by 



n/h—l , X 

^ 4 f (n - ih)h - -b^ ] (n - ib) = -n^ + 0{bn^). 

i=l ^ ^ 



Again, as shown in [16], the application of these Householder entries can be organized so that the commu- 
nication cost is O (^x) words where b < V^/2. 

A. 4 Reduction from Banded to Tridiagonal 

Some analysis for the third row of Table [l] appears in [7J for the flop counts in the case q = 1 for each 
i. We reproduce their results for general choices of {q} and also provide analysis for the cost of the data 
movement. We will re-use the terminology from [7 and refer the reader to Figure 2(b) in that paper for a 
detailed illustration of one bulge-chase. 

Bulge-chasing can be decomposed into four operations: "QR" of a (d + c) x c matrix, "Pre" -multiplication 
of an orthogonal matrix which updates d -\- c rows, a two-sided "Sym"mmetric update of {d -\- c) x {d -\- 
c) submatrix, and "Post" -multiplication of the transpose of the orthogonal matrix which updates d -\- c 
columns. We can count the arithmetic cost in terms of applying Householder vectors to sets of rows/columns 



^This method mirrors the computation performed in LAPACK's xORGQR for generating the corresponding orthogonal matrix 
from a set of Householder vectors. 
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(using Lemmas A.l and A.2). Consider eliminating column k of the parallelogram. In the QR step, the 
associated Householder transformation must be applied to the remaining c — k columns. In the Pre step, 
the transformation is applied to b — c columns. In the Sym step, the two-sided update is performed on 
a {d -\- c) X {d -\- c) submatrix. Finally, in the Post step, the transformation is applied from the right to 
b — {c — k) rows. Since every Householder vector used to eliminate the parallelogram has d nonzeros, the 
cost of eliminating each column of the parallelogram is 4:d{2b + d), and the cost of eliminating the entire 
parallelogram (c columns) is 8bcd -\- Acd'^ . 

Bulge-chasing pushes the bulge b columns down the band each time. Thus, the number of flops required 
to eliminate a set of d diagonals is about 



n/c 



^ {8bcd + 4cd^) "^-^ = {M + 2^)n^ 



Since d < b^ this cost is bounded above by 6dn'^. Taking s steps to eliminate all the diagonals, we see that 
the total arithmetic cost of the banded-to-tridiagonal reduction (ignoring lower order terms) is 



^6di 



We now consider the data movement. First, suppose the band is small enough to fit into memory into 
one fourth of memory; that is, n{b-\- 1) < M/4 (we pick the fraction 1/4 so that the band fits in a quarter of 
the memory, the extra storage required for bulge chasing fits in a quarter of the memory, and the other half 
of memory can be reserved for the vector updating process discussed below). Then the communication cost 
of the reduction is simply reading in the band: nb words. Suppose the band is too large to fit into memory. 
In this case, since we chase each bulge all the way down the band, we will obtain very little locality. Even 
though we eliminate a bulge as soon as it's been created, we ignore the fact that some data (including the 
bulge itself) fits in memory and will be reused (it is a lower order term anyway). Consider the number of 
words accessed to chase one bulge: QR accesses + words. Pre accesses {b — c)(c + d) words, Sym 
accesses ^(c + d^ words, and Post accesses b{c + d) — words. The total number of words accessed for 
one bulge-chase is then 2b{c + d) + ^(c + d)^ — c^. Using the same summation as before, the communication 
cost of chasing all the bulges from eliminating d diagonals (ignoring lower order terms) is 

Since c + d < 6, and by ignoring the negative term, we obtain a communication cost for eliminating d 
diagonals of O ((l + ^) v?). 

In order to obtain the communication cost for the entire banded-to-tridiagonal step, we sum up the costs 
of all the reductions of {di} diagonals until the band fits entirely in memory. That is, the number of words 
moved in the banded-to-tridiagonal step is 




where n{bt + 1) < M. Note that if no such t exists (in the case 2n > M), the number of words is given by 




where bs-\-i = 1. 
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If eigenvectors are desired, each of the orthogonal transformations used to ehminate a parahelogram or 
chase a bulge must also be applied to the orthogonal matrix Q which is formed explicitly in the full-to-banded 
step. Note that the transformations must be applied to Q from the right (we are forming the matrix QU 
such that (QU)^ AQU = T is tridiagonal) . However, these updates are one-sided and since many of the 
orthogonal transformations work on mutually disjoint sets of columns, we can reorder the updates to Q. 
The restrictions on this reordering are made explicit in ^ , and we use the same reordering proposed in that 
paper. Note that to be able to reorder the updates on the eigenvector matrix Q, we must temporarily store 
a set of Householder entries which are computed in the band reduction process. 

Suppose we save up all the Householder entries from eliminating k parallelograms and chasing their 
associated bulges completely off the band. Since every Householder transformation consists of cd Householder 
entries, saving up the transformations after eliminating k parallelograms requires O (^kcd^) words of extra 
memory. 

Following the notation of [9 , we number the Householder transformations by the ordered pair (^, j) 
where i is the number of the initial parallelogram annihilated and j is the number of the bulge chased by 
the transformation (so 1 < i < k and 1 < j < n/b). For example, in Figure [sj Qi would we numbered 
(1, 1), Qs would be numbered (1,3), and the orthogonal transformation annihilating parallelogram 6 would 
be numbered (2,1). We will group these Householder transformations not by their initial parallelogram 
number but by their bulge number (the second index in the ordered pair). As argued in 0, we can apply 
these updates in groups in reverse order, so that all (i,4) transformations will be applied before any (i,3) 
transformations. As long as the order of the updates within the groups is done in increasing order with respect 
to the initial parallelogram number (e.g. (3,4) before (4,4)), we maintain correctness of the algorithm. 

Consider the group of transformations with bulge number j. The transformation (1, j) eliminates a 
parallelogram of size d x c and therefore updates d -\- c columns of Q. Since the transformation (2, j) 
eliminates the adjacent parallelogram, it updates an overlapping set of columns. The first (leftmost) d 
columns are affected by (1, j), but the last (rightmost) c columns are disjoint. Since all of the parallelograms 
are adjacent, the total number of columns updated by the k transformations with bulge number j is d-\- kc. 

We choose k such that kc = Q{^/M). Note that we may always choose k in this way if we assume 
the initial bandwidth is chosen to be 6 = 0(a/M) since c < b (this is a safe assumption because choosing 
a larger bandwidth does not reduce memory traffic in the full-to-banded step). This implies that since 
d < b^ the extra memory required is 0(n\/M) words, each group of Householder transformations consists of 
kcd = 0(M) words, and that the number of columns affected by each group of Householder transformations 
is 0{y/M). 

We now describe the updates to Q. We assume that all of the Householder entries for k sweeps of the 
reduction process have been stored in slow memory and that Q resides in slow memoryj^ The updating 



process is given as Algorithm 12 



Let J = 0{n/b) be the largest bulge number. We update Q one row panel at a time in reverse order 
by the bulge number j and then in increasing order by sweep number i. Since the size of each group 
of transformations is 0(M), and the number of rows and columns of Q accessed at each iteration are 
both 0(a/M), all of the necessary information resides in fast memory during the application of the (i, j) 
transformations. The columns effected by transformations (:, j) and (:, j — 1) may overlap, but since the 
rightmost column effected by block j is to the right of the rightmost column effected by j — 1 for each j, 
applying the blocks of transformations in this order ensures that the entries of each row panel of Q are read 
from and written to slow memory no more than once. However, each Householder entry in a transformation 
is read once for each row panel updated. Thus, the number of words accessed to/from slow memory for the 
vector update after k sweeps, under the choices kc = 6(a/M) and B = 6(\/M) is 

n ,n 
B 



(^kcd^^nB^ =0(n^). 



^In the communication cost for the reduction of the band to tridiagonal form, we assumed that once the band fits in memory- 
it may stay there until the algorithm completes. If eigenvectors are desired, then this update process must occur with the band 
itself residing in fast memory. We choose constants such that half of fast memory is reserved for storing the band and the other 
half of fast memory may be used for the updating process. 
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Algorithm 12 Updating eigenvector matrix Q with Householder transformations from k sweeps 



Require: Q G ]R^><^ is orthogonal, transformations U{i^j) have been computed for 1 < i < k and 1 < j < J, 
and assume that B divides n for simplicity 
1: for I = 1 to n — B step B do 
2: for j = J down to 1 do 
3: read U{i,j) for 1 < i < k 

4: read columns of row panel Q{i : i -\- B — 1^ :) (not already in fast memory) effected by [7(1 : /c, j) 
5: for i = 1 to do 
6: apply 
7: end for 
8: end for 
9: end for 
Ensure: Q is orthogonal 



The number of times this updating process must be done to eliminate d diagonals is and thus the total 
communication cost is O (^:^^' If ^ is the number of steps taken to reach tridiagonal form, the total extra 
number of words moved in order to update the vector matrix in the banded-to-tridiagonal step is 

O 



After the matrices QU and T are computed, we use MRRR to find the eigenvalues and (if desired) the 
eigenvectors of T. In order to compute the eigenvectors of A we must multiply the matrix QU times the 
eigenvector matrix of T, and doing so costs 2n^ fiops and 0(n^ / y/M) memory references. 

B SVD via SBR 

The situation for the SVD is analogous to the symmetric eigenvalue problem using symmetric band reduction: 

1. We reduce the matrix A to upper triangular band form S = U^AV with bandwidth b ^ M^/^, and U 
and V are orthogonal. 

2. We reduce S to upper bidiagonal form B = W^SX, where W and X are orthogonal. We again do 
this by successively zeroing out blocks of entries of S of carefully chosen sizes. 

3. We find the singular values of 5, and its left and/or right singular vectors if desired. There are a number 
of algorithms that can compute just the singular values in O(n^) fiops and memory references, in which 
case we are done. But computing the singular vectors as well is more complicated: Extending MRRR 
from the symmetric tridiagonal eigenproblem to the bidiagonal SVD has been an open problem for 
many years, but an extension was recently announced [IT, whose existence we assume here. 

4. If singular vectors are desired, we must multiply the transformations from steps 1 and 2 times the 
singular vectors from step 3, for an additional cost of O(n^) fiops and 0(n^/\/M) slow memory 
references. 



The algorithm and analysis for the SVD are entirely analogous to the symmetric case. Algorithm 13 
shows the first part of the reduction to bidiagonal form. 

The first pass of the second part of the SVD algorithm, reducing the upper triangular band matrix to 
bidiagonal form, is shown in Figure |9] Table [4] provides the asymptotic computation and communication 
complexity of computing the SVD for the case when only singular values are desired and the case when both 
left and right singular vectors as well as the singular values are desired. The analysis for Table [4] is analogous 
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Algorithm 13 Reduction of A from dense to upper triangular band form with bandwidth b 



Require: A G R^^^, and assume that b divides n for simphcity 
1: for z = lton — 26 + 1 step b do 
2: [Qc, Rc] = TSQR{A{i :n,i:i^b-l)) 
3: A{i : n, :) = • A{i : n, :) 
4: [Q^, i?^] = TSQR{{A{i : z + 6 - 1, i + 6 : n))^) 
5: i^b : n) = A{:, i^b:n) -Qr 

6: end for 

Ensure: A is banded upper triangular with bandwidth b 




Figure 9: First pass of reducing a triangular matrix from banded to bidiagonal form 
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# flops 


# words moved 


Direct Bidiagonalization 
values 

vectors 


3 


( 



(vm) 




Full-to-banded 
values 

vectors 






Banded-to-tridiagonal 
values 

vectors 


i=l 


(^nbt + ^ 




) 


Recovering singular vectors 

vectors 


4n^ 


O 


/ n3 \ 





Table 4: Asymptotic computation and communication costs of computing the SVD of a square matrix. Only 
the leading term is given in the flop count. The cost of the conventional algorithm is given in the first block 
row. The cost of the two-step approach is the sum of the bottom three block rows. If both sets of singular 
vectors are desired, the cost of each step is the sum of the two quantities given. The parameter t is defined 
such that n{bt + 1) < M/4, or, if no such t exists, let t = s + 1. 

to the symmetric case. Except in the case of the reduction from banded to bidiagonal form, the leading term 
in the flop count at each step doubles as compared to the symmetric eigenproblem. 

In order to attain the communication lower bound, we use the same scheme for choosing {q}, and 
{di} as in the symmetric case. That is, for large matrices (n > M), we reduce the band to bidiagonal in one 
sweep, and for smaller matrices {^/M < n < M), we use the first sweep to reduce the size of the band so that 
it fits in fast memory and the second sweep to reduce to bidiagonal form. As in the case of the symmetric 
eigenproblem, if only singular values are desired, the leading term in the flop count is the same as for the 
conventional, non- communication- avoiding algorithm. If both left and right singular vectors are desired, for 
large matrices, we increase the leading term in the flop count by a factor of 2, and for smaller matrices, we 
increase the leading term in the flop count by a factor of 2.6. The conventional approach costs flops if 
only singular values are desired and flops if all vectors and values are desired. Computing the SVD 

via SBR costs flops if only singular values are required and either or flops if all vectors and 
values are desired. 
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